350 NLP Projects With Code

350 NLP
Projects
with Code
The Most Powerful NLP-Weapon Arsenal
Himanshu Ramchandani
M.Tech | Data Science
NLP Migrant Workers' Paradise: Almost the most complete
Chinese NLP resource library
In the process of getting started and getting familiar with NLP, I used a lot of packages
on github, so I sorted it out and shared it here.
⭐
Many bags are very interesting and worth collecting, satisfying everyone's collection
addiction! If you find it useful, please share and star ,thanks!
❤️❤️❤️
Long-term irregular updates, welcome to watch and fork!
🍆🍒🍐🍊 🌻🍓🍈🍅🍍
* Corpus * Document Processing
* Thesaurus and lexical tools * Table Processing
* Pre-trained language model * Text Matching
* Extraction * Text Data Enhancement
* Knowledge map * Text Retrieval
* Text generation * Reading Comprehension
* Text summarization * Sentiment Analysis
* Intelligent question answering * Common Regular

Expressions
* Text error correction
* Speech Processing
* Common regular expressions * Text visualization
* Event extraction * Text annotation tool
* Machine translation * Comprehensive tool
* Digital transformation * Funny and funny tool
* Anaphora resolution * Course report interview, etc.
* Text clustering * Competition
* Text classification * Financial NLP
* Knowledge reasoning * Medical NLP
* Explainable NLP * Legal NLP
* Text adversarial attack * Text generation image
* Others
corpus
Resource name Description Link
(Name)
Corpus of names wainshine/Chinese-

Names-Corpus
Chinese-Word-Vector Various Chinese word vectors github repo

s
Chinese Chat Corpus The library includes Douban link
Duolun, PTT gossip corpus,
Qingyun corpus, TV drama
dialogue corpus, Tieba forum
reply corpus, Weibo corpus,
little yellow chicken corpus
Chinese rumor data In this data file, each line github

contains a rumor data in json
format
Chinese Question link extract code

Answering Dataset 2dva
WeChat official The 3G corpus, which includes github

account corpus some articles from WeChat
official accounts captured from
the web, has removed HTML
and only contains plain text.
One article per line, in JSON
format, name is the name of
the WeChat official account,
account is the ID of the
WeChat official account, title is
the title, and content is the text
Chinese natural github

language processing
corpus, data set
Task-based dialogue [The most complete github
English dataset task-based dialogue data set]
mainly introduces a complete
task-based dialogue data set,
which covers the main
information of all commonly
used data sets in the field of
task-based dialogue. In
addition, in order to help
researchers better grasp the
context of field progress, we
present the State-of-the-art
experimental results on
several datasets in the form of
Leaderboard.
Speech Recognition Create an Automatic Speech github

Corpus Generation Recognition (ASR) corpus
Tool from online videos with
audio/subtitles
LitBank NLP dataset A corpus of 100 labeled github

English novels supporting
natural language processing
and computational humanities
tasks
ChineseULMFiT Sentiment Analysis Text github

Classification Corpus and
Model
The administrative github

division data of
provinces,
municipalities and
towns are marked
with pinyin
Automated github
Summarization
Corpus of Education
Industry News
Chinese Natural github

Language Processing
Dataset
Baidu Zhizhi Q&A More than 5.8 million github

Corpus questions, 9.38 million
answers, 5800 classification
labels. Based on the question
and answer corpus, it can
support a variety of
applications, such as chat
question and answer, logic
mining
Wikipedia Massively 85 languages, 1620 language github

Parallel Text Corpus pairs, 135M contrasting
sentences
Ancient Poetry github repo

Thesaurus
more complete
ancient poetry
lexicon
Low memory loading Use the new version of nlp github

Wikipedia data library to load 17GB+ English
Wiki corpus and only occupy
9MB of memory Traversal
speed 2-3 Gbit/s
couplet data 700,000 couplets, more than github

700,000 couplets
"Color Dictionary" github
dataset
42GB of JD github
Customer Service
Dialogue Data
(CSDD)
700,000 couplet data link
Username Blacklist github

List
Dependency parsing 40,000 high-quality labeled Homepage

corpus data
People's Daily github

Corpus Processing
Toolset
False news dataset github

fake news corpus
Poetry Quality github

Evaluation /
Fine-grained
Emotional Poetry
Corpus
Open tasks related to Dataset and current best github

Chinese natural results
language processing
Chinese abbreviation github

dataset
Chinese task Representative dataset - github
benchmarking benchmark (pretrained) model
- corpus - baseline - toolkit -
leaderboard
Chinese Rumor github

Database
CLUEDatasetSearch Chinese and English NLP github

datasets Search all Chinese
NLP datasets, with commonly
used English NLP datasets
attached
Multi-Document github
Summarization
Dataset
Make Everyone Transform impolite sentences paper and code

"Courteous" Polite into polite ones while
Migration Quest preserving meaning, providing
a dataset with 139M+
instances
Cantonese/English github
Conversational
Bilingual Corpus
List of Chinese NLP github

datasets
Nomenclature github
recognition data set
of person-like
names/place
names/organization
names
Chinese Language Includes representative github

Comprehension datasets & benchmark models
Benchmark & corpora & leaderboards
OpenCLaP Civil documents, criminal github

multi-domain open documents, Baidu
source Chinese Encyclopedia
pre-trained language
model warehouse
Chinese full word DRCD dataset: Released by github

coverage BERT and Delta Research Institute of
two reading Taiwan, China, it has the same
comprehension data form as SQuAD, and is an
extractive reading
comprehension dataset based
on traditional Chinese.
CMRC 2018 dataset: Chinese

machine reading
comprehension data released
by the Xunfei Joint Laboratory
of Harbin Institute of
Technology. According to a
given question, the system
needs to extract fragments
from the text as answers, in
the same form as SQuAD.
Dakshina dataset Latin/native script parallel github

dataset for twelve South Asian
languages
OPUS-100 Multilingual (100 kinds) github
parallel corpus centered on
English
Chinese Reading github

Comprehension
Dataset
Chinese natural github

language processing
vector collection
Chinese Language Includes representative github

Comprehension datasets, benchmark
Benchmark (pretrained) models, corpora,
leaderboards
Large list of NLP github

datasets/benchmark
tasks
LitBank NLP dataset A corpus of 100 labeled github

English novels supporting
and computational humanities
tasks
700,000 couplet data github
Parallel Corpus of The short chapters include github

Classical Chinese "The Analects of Confucius",
(Ancient "Mencius", "Zuo Zhuan" and
Chinese)-Modern other short ancient books,
Chinese which have been merged with
"Zi Zhi Tong Jian"
COLDDateset, Covers topics such as race, paper
Chinese Offensive gender, and region, and the
Language Detection data will be released after the
Dataset paper is published
Thesaurus and Lexical Tools

(Name)
textfilter Sensitive word observerss/textfilter

filtering in Chinese
and English
Name extraction Chinese (modern, cocoNLP

function ancient) names,
Japanese names,
Chinese surnames
and first names,
titles (big aunt,
little aunt, etc.),
English ->
Chinese name
(John Lee), idiom
dictionary
Chinese National People's github

Abbreviation Library Congress:
National People's
Congress; China:
People's Republic
of China;
Women's Tennis:
Women/n Tennis/n
Game/vn
Chinese Dictionaries How to dismantle kfcd/chaizi
Chinese
characters (1)
How to dismantle
(2) How to
dismantle (3)
Lexical Sentiment Mountain spring rainarch/SentiBridge

Value water:
0.400704566541
Sufficient
: 0.37006739587
Chinese thesaurus, dongxiexidian/Chinese

stop words,
sensitive words
python-pinyin Convert Chinese mozillazg/python-pinyin

characters to
Pinyin
zhtools Conversion skydark/nstools

between
Traditional and
Simplified Chinese
English simulation say wo i ni #say: I tinyfool/ChineseWithEnglish

Chinese love you
pronunciation
engine
chinese_dictionary Thesaurus, guotong1988/chinese_dictionary

antonym, negative
thesaurus
wordninja English string wordninja
segmentation and
word extraction
without spaces
Vocabulary related data

to automobile brand
and automobile
parts
Thesaurus IT thesaurus, link

organized by THU financial
thesaurus, idiom
thesaurus, place
names, historical
celebrity
thesaurus, poetry
thesaurus,
medical
thesaurus, diet
thesaurus, legal
thesaurus,
automobile
thesaurus, animal
thesaurus
Crime Legal Terms Contains 856 github
and Classification crime knowledge
Model graphs, crime
prediction based
on 2.8 million
crime training
database, 13
types of question
classification and
legal information
question and
answer function
based on 20W
legal question and
answer pairs
Word segmentation Baidu network disk link -

corpus + code extraction code pea6
Chinese word keras link

segmentation + implementation
part-of-speech
tagging based on
Bi-LSTM + CRF
Chinese word link

segmentation and
part-of-speech
tagging based on
Universal
Transformer + CRF
Fast Neural Network java version

Word Segmentation
Package
chinese-xinhua Zhonghua Xinhua github
dictionary
database and api,
including
commonly used
Xiehouyu, idioms,
words and
Chinese
characters
SpaCy Chinese Contains Parser, github

model NER, syntax tree
and other
functions. Some
English packages
use spacy's
English model. If
you want to adapt
to Chinese, you
may need to use
spacy's Chinese
model.
Chinese character github

data
Synonyms Chinese github

Synonym Toolkit
Harvest Text Domain adaptive github

text mining tools
(new word
discovery-sentime
nt analysis-entity
linking, etc.)
word2word Easy-to-use github
multilingual
word-word pair set
62
languages/3,564
multilingual pairs
Polyphone github
dictionary data and
codes
Chinese characters, github

words, idioms query
interface
103976 English (sql version, csv github

vocabulary packs version, Excel
version)
Big list of swear github

words in English
word pinyin data github
Number calling github

library in 186
languages
Large-scale name github

database of
countries around the
world
Chinese character Extract the github
feature extractor features of
(featurizer) Chinese
characters
(pronunciation
features, font
features) for deep
learning features
char_featurizer - github
Chinese character
feature extraction
tool
Python interface github

library of mecab, the
CJK word
segmentation library
g2pC context-based github

Chinese
pronunciation
automatic marking
module
ssc, Sound Shape Phonetic code - version 1

Code Chinese character
string similarity version 2
calculation method
blog/introduction
based on
"phonetic code"
Acquisition of github
multiple
meanings/sense
items of Chinese
words and semantic
disambiguation of
specific sentences
based on the
encyclopedia
knowledge base
Tokenizer is a fast github

and customizable
text tokenization
library
Tokenizers State-of-the-art github

tokenizer with a
focus on
performance and
versatility
Realize text "face github

changing" through
synonym
replacement
token2index is a github
powerful lightweight
term index library
compatible with
PyTorch/Tensorflow
Traditional and github

Simplified
Conversion
Cantonese NLP github
Tools
domain dictionary Professional github

dictionary
knowledge base
covering 68 fields
with a total of 9.16
million words
Pre-trained language model & large model

Resource name (Name) Description Link
BMList Big Model Big List github
Chinese translation of bert link

papers
The slides of the original link

author of bert
Text Classification Practice github
bert tutorial text github

classification tutorial
Bert pytorch github

implementation
Bert pytorch github

implementation
BERT generates sentence github
vectors, BERT does text
classification and text
similarity calculation
Diagram of bert and ELMO github
BERT Pre-trained models github

and downstream
applications
Language/Knowledge github
Representation Tool BERT
& ERNIE
Using the gpt-2 language github

model in Kashgari
Facebook LAMA Probes for analyzing factual and github

commonsense knowledge contained
in pretrained language models.
Language model analysis, providing a
unified access interface for
Transformer-XL/BERT/ELMo/GPT
pre-trained language models
Chinese GPT2 training github

code
XLMFacebook's github
cross-language pre-trained
language model
Massive Chinese github

pre-trained ALBERT model
Transformers 20 Supports TensorFlow 20 and github
PyTorch's natural language
processing pre-trained language
models (BERT, GPT-2, RoBERTa,
XLM, DistilBert, XLNet...) 8
architectures/33 pre-trained
models/102 languages
8 papers sort out the github

progress and reflection of
BERT related models
French RoBERTa French RoBERTa pre-trained link

pre-trained language language model trained with 138GB
model corpus
Chinese pre-trained Pretrain Chinese Model based on github

ELECTREA model confrontational learning
albert-chinese-ner Use the pre-trained language model github

ALBERT to do Chinese NER
Open source pre-trained github

language model collection
Chinese ELECTRA github

pre-training model
Predicting Next Word with github

Transformers (BERT,
XLNet, Bart, Electra,
Roberta, XLM-Roberta)
(Model Comparison)
TensorFlow Hub New language models for 40+ link

languages (including Chinese)
UER Chinese pre-trained model github
warehouses based on different
corpora, encoders, and target tasks
(including BERT, GPT, ELMO, etc.)
Open source pre-trained github

language model collection
Multilingual sentence github

vector package
Language Model as a Language Model as a Service github

Service (LMaaS)
Open source language 20 billion parameters, currently the github

model GPT-NeoX-20B largest publicly accessible pre-trained
general autoregressive language
model
Chinese Science Literature Contains 396,209 meta-information github

Dataset (CSL) (titles, abstracts, keywords,
disciplines, categories) of papers in
Chinese core journals. The CSL
dataset can be used as a pre-training
corpus, and can also be used to
construct many NLP tasks, such as
text summarization (title prediction),
keyword generation, and text
classification.
Large model development github

artifact
extract
time extraction It has been integrated into the java

python package cocoNLP , version
welcome to try
python
version
Neural network relationship Chinese is not supported yet github

extraction pytorch
Bert-based named entity Chinese is not supported yet github

recognition pytorch
Keyword (Keyphrase) extraction github

package pke
BLINK's most advanced entity github

link library
Named entity recognition github

implemented by BERT/CRF
Support batch parallel github

LatticeLSTM Chinese named
entity recognition
Building a Model for Medical Contains dictionaries and corpus github

Entity Recognition annotations, based on python
Pipeline Entity and Relationship - Entity and Relation Extraction github
Extraction Based on Based on TensorFlow and BERT
TensorFlow and BERT Pipeline entity and relationship
extraction based on TensorFlow
and BERT, the solution to the
information extraction task of the
2019 Language and Intelligence
Technology Competition.
Schema based Knowledge
Extraction, SKE 2019
Chinese named entity github

recognition NeuroNER vs
BertNER
Chinese Named Entity github

Recognition Based on BERT
Chinese key phrase extraction github

tool
bert tensorflow version for Chinese github

named entity recognition
bert-Kashgari Kashgari, a keras-based github

encapsulation classification and
labeling framework, can build a
classification or sequence
labeling model in a few minutes
cocoNLP Extraction of information such as github

name, address, email address,
mobile phone number, mobile
phone attribution, etc., rake
phrase extraction algorithm.
Microsoft Multilingual github
Number/Unit/Eg Date Time
Recognition Package
Baidu open source benchmark github

information extraction system
Chinese address word github

segmentation (identification and
extraction of address elements),
NER through sequence
annotation
Open Domain Text Knowledge github

Triple Extraction and
Knowledge Base Construction
Based on Dependency Syntax
Chinese keyword extraction github

method based on pre-training
model
chinese_keyphrase_extractor A tool for chinese keyphrase github

(CKPE) extraction A tool for quickly
extracting and identifying
keyphrases from natural
language text
Simple resume parser to extract github

key information from resumes
BERT-NER-Pytorch three github

different modes of BERT
Chinese NER experiments
knowledge map
Tsinghua University Baidu, Chinese Wiki, English link

XLORE Chinese-English Wiki
cross-language
encyclopedia knowledge
map
Automatic generation of github

document maps
Question answering github

system based on
knowledge graph in This repo
medical field refers to
github
Chinese character github

relationship knowledge
map project
AmpliGraph Knowledge github

Graph Representation
Learning (Python) Library
Knowledge Graph
Concept Link Prediction
Chinese knowledge map github

materials, data and tools
Chinese Knowledge Extract triplet information and github

Graph Based on Baidu build a Chinese knowledge map
Encyclopedia
Zincbase Knowledge github
Graph Construction Toolkit
Question answering github

system based on
knowledge graph
Collation of knowledge github

map deep learning related
materials
Southeast University github

"Knowledge Graph"
graduate course (data)
Knowledge map car audio github

work project
"One Piece" Knowledge github

Graph
A dataset of 132 Covers common sense, city, link

knowledge graphs finance, agriculture, geography,
weather, social networking,
Internet of Things, medical care,
entertainment, life, business,
travel, science and education
Large-scale, structured, link

Chinese-English bilingual
COVID-19 Knowledge
Graph (COKG-19)
Event Triple Extraction github
Based on Dependency
Syntax and Semantic Role
Labeling
Abstract Knowledge The current scale is 500,000, github

Graph supporting the abstraction of
nominal entities, state
descriptions, and event actions
Large-scale Chinese github

knowledge map data 1.4
billion entities
Jiagu natural language Based on models such as github

processing tool BiLSTM, it provides functions
such as knowledge graph
relationship extraction, Chinese
word segmentation,
part-of-speech tagging, named
entity recognition, sentiment
analysis, new word discovery,
keyword text summarization,
text clustering, etc.
medical_NER - Chinese github

Medical Knowledge Graph
Named Entity Recognition
A large list of learning github

materials/datasets/tool
resources related to
knowledge graphs
LibKGE is a knowledge github
graph embedding library
for reproducible research
Military field knowledge Including aircraft, space github

map question answering equipment, etc. 8 categories,
project based on mongodb more than 100 subcategories, a
storage total of 5,800 items of military
weapons knowledge base, the
project does not use a graph
database for storage, through
jieba to analyze questions,
identify entity items in
questions, and complete based
on query templates The query
of multiple types of questions is
mainly to provide a demo of the
question-and-answer thinking in
the industry.
Jingdong Commodity github

Knowledge Graph
Chinese Relation github

Extraction Based on
Distant Supervision
Intelligent Question github

Answering System Based
on Medical Knowledge
Graph
BLINK's most advanced github

entity link library
A small securities github
knowledge
graph/knowledge base
dstlr unstructured text github

scalable knowledge map
construction platform
Baidu Encyclopedia Using BERT-based fine-tuning github

character entry attribute and feature extraction methods
extraction for knowledge graphs
Data related to COVID-19 New crown and other types of github

pneumonia Chinese medical
dialogue dataset; open data github
sources of institutions such as
Tsinghua University (COVID-19)
DGL-KE Graph github

Embedding
Representation Learning
Algorithm
causality map method data
Causal Event Pairs Based link

on Multi-Domain Text
Datasets
text generation
Texar Toolkit for Text github
Generation and
Beyond
Prof. Ehud Reiter's Blog link Professor Wan

Xiaojun of Peking
University strongly
recommends this blog,
which conducts in-depth
discussions and
reflections on NLG
technology, evaluation
and application.
Large list of resources github

related to text generation
Open Domain Dialogue Natural language link

Generation and Its Practice generation allows
in Microsoft Xiaoice machines to
master the ability
of automatic
creation
Text Generation Control github
A large list of natural github

language generation related
resources
Evaluating Natural link

Language Generation with
BLEURT
Automatic couplet data and Code link

robots
700,000 couplet data
Automatically generate Generating github

comments comments based
on Hacker News
article titles using
Transformer codec
model
Natural language github

generation SQL statement
(English)
Natural Language github

Generation Resource
Collection
Benchmarking Chinese github

Generation Tasks
Topic-specific text github

generation/text
augmentation based on
GPT2
Encoding, Tokenization, github

and Implementation of a
Controlled and Efficient Text
Generation Methodology
TextFooler's adversarial text github

generation module for text
classification/inference
SimBERT BERT model github
based on UniLM
idea, integrating
retrieval and
generation
New word generation and Non-existing words github

sentence making generate new
words from scratch
with GPT-2
variants, their
definitions, and
example
sentences
Automatically generate github

multiple choice questions
from text
Synthetic Data Generation github

Benchmark
text summary
Resource name (Name) Descriptio Link
n
Chinese text summarization/keyword extraction github
Automatic Summarization of Resume Based on github

Named Entity Recognition
Automatic text summarization library TextTeaser English github
only
Extractive summary extraction based on the latest github

language models such as BERT
A Comprehensive Guide to Text Summarization with link

Deep Learning in Python
(Colab) Abstract Text Summary Implementation github

Highlights (Tutorial
Smart Q&A
Chinese chatbot Train the chatbot you want github

according to your own corpus,
which can be used in scenarios
such as intelligent customer
service, online question and
answer, intelligent chat, etc.
Interesting robot qingyun Chinese chatbot trained by github

qingyun
Open dialogue robots, github

knowledge graphs, semantic
understanding, natural language
processing tools and data
qa right robot Amodel-for-Retrivalchatbot - git
customer service robot, Chinese
Retreival chatbot (Chinese
retrieval robot)
ConvLab open source github

multi-domain end-to-end
dialogue system platform
A dialog system based on the github

latest version of rasa
Chatbots based on the github

financial-judicial domain (with
the nature of small talk)
End-to-end closed-domain github

dialogue system
MiningZhiDaoQACorpus 5.8 million Baidu Zhizhi Q&A github

data mining project, Baidu Zhizhi
Q&A corpus, including more than
5.8 million questions, each with a
question label. Based on this
question and answer corpus, it
can support a variety of
applications, such as logic
mining
GPT2 model GPT2-chitchat for github

Chinese chatting
Selection of relevant resource github

lists (Leaderboards, Datasets,
Papers) based on multiple
rounds of responses from
retrieval chatbots
Microsoft Conversational Bot github

Framework
chatbot-list Application and architecture of github

intelligent customer service and
chatbots, algorithm sharing and
introduction in the industry
Chinese medical dialogue data github

Chinese medical dialogue data
set
A Large-Scale Medical Dialogue Contains 1.1 million medical github

Dataset consultations and 4 million
doctor-patient dialogues
Large-scale cross-domain paper

Chinese task-oriented & data
multi-round dialogue dataset
and model CrossWOZ
Open source conversational github

information search platform
Contextual Interaction github

Multimodal Dialogue Challenge
2020 (DSTC9 2020)
Use Quora questions to github

paraphrase the trained T5
questions (Paraphrase)
Google releases Taskmaster-2 github
natural language task dialogue
dataset
Haystack's flexible, powerful, github

and extensible Question
Answering (QA) framework
End-to-end closed-domain github

dialogue system
Amazon releases github

knowledge-based
human-human open domain
dialogue dataset
Albert Large QA model trained github

based on Baidu webqa and
dureader dataset
CommonsenseQA link
Commonsense-Oriented
English QA Challenge
MedQuAD (English) Medical github

Question Answering Dataset
A Q&A engine using Wikipedia github

text as context, based on Albert
and Electra
A question answering attempt Functions include Lyrics github

based on the 14W song Solitaire, Finding Songs with
knowledge base Known Lyrics, and Questions
and Answers about the
Triangular Relationship of Song
Artists Lyrics
text error correction

Chinese text error correction github

module code
English spell checking library github
Python spell checking library github
GitHub Typo Corpus Large-Scale github

GitHub Multilingual
Spelling/Grammar Error Dataset
BertPunc BERT-based github

state-of-the-art punctuation repair
model
Chinese writing proofreading tool github
Text Error Correction Literature List Chinese Spell Checking github

(CSC) and Grammatical Error
Correction (GEC)
Winner of Text Smart Proofreading It has been applied, from the link
Contest team of Soochow University
and Dharma Academy
multimodal
(Name)
Chinese Multimodal Huawei's Noah's Ark Laboratory open github

Dataset "Wukong" source large-scale, including 100 million
text pairs
Chinese graphic The Chinese version of the CLIP github

representation pre-training model, open source multiple
pre-training model model scales, and a few lines of code can
Chinese-CLIP handle Chinese image-text representation
extraction & image-text retrieval
speech processing
ASR Speech Dataset + github

Chinese Speech
Recognition System Based
on Deep Learning
Tsinghua University data_thchs30tgz-O
THCHS30 Chinese Speech penSLR domestic
Dataset image
data_thchs30tgz
test-noisetgz-Open
SLR domestic
image test-noisetgz
resourcetgz-OpenS
LR domestic image
resourcetgz
Free ST Chinese
Mandarin Corpus
Free ST Chinese
Mandarin Corpus
AIShell-1 open
source version
dataset-OpenSLR
domestic image
AIShell-1 open
source version
dataset
Primewords
Chinese Corpus Set
1-OpenSLR
Domestic Mirror
Primewords
Chinese Corpus Set
1
laughter detector github

Common Voice Speech Includes over 1,400 link
Recognition Dataset New hours of speech
Version samples from 42,000
contributors, covering
github
speech-aligner A tool for generating github

phoneme-level
time-aligned
annotations from
"human voice speech"
and its "language text"
ASR Speech github

Dictionary/Dictionary
Speech Sentiment Analysis github
masr Chinese speech github

recognition, providing
pre-training model, high
recognition rate
Chinese Text Normalization github

for Speech Recognition
Voice quality evaluation github

indicators (MOSNet,
BSSEval, STOI, PESQ,
SRMR)
Chinese/English github
Pronunciation Dictionary
for Speech Recognition
Multilingual speech-text Includes audio, text github
translation corpus released transcription and
by CoVoSTEFacebook English translation in 11
languages (French,
German, Dutch,
Russian, Spanish,
Italian, Turkish, Persian,
Swedish, Mongolian
and Chinese)
Parakeet text-to-speech github

synthesis based on
PaddlePaddle
(Java) Accurate Speech github

Natural Language
Detection Library
Multilingual speech-text github

translation corpus released
by CoVoSTEFacebook
Text-to-Speech Synthesis github

Implemented in TensorFlow
2
Python audio feature github

extraction package
ViSQOL audio quality github

perception is objective and
complete reference index,
divided into two modes:
audio and voice
zhrtvc Easy-to-use Chinese github
voice clone and
Chinese speech
synthesis system
aukit An easy-to-use speech github

processing toolbox,
including speech noise
reduction, audio format
conversion, feature
spectrum generation
and other modules
phkit An easy-to-use github

phoneme processing
toolbox, including
Chinese phonemes,
English phonemes,
text-to-pinyin, text
regularization and other
modules
zhvoice Chinese speech corpus, github

the speech is clearer
and more natural,
including 8 open source
data sets, 3200
speakers, 900 hours of
speech, 13 million
words
audio for speech behavior , binarization, speaker github

detection recognition, automatic
speech recognition,
emotion recognition and
other audio annotation
tools
Deep Learning Emotional github
Text-to-Speech Synthesis
Python audio data github

augmentation library
Audio Enhancement Based github

on Large-Scale Audio
Dataset Audioset
voice transfer github
document processing
(Name)
LayoutLM-v3 github
Document
Understanding
Model
PyLaia Deep github

Learning Toolkit
for Handwritten
Document
Analysis
Single-document github
unsupervised
keyword
extraction
DocSearch Free github
Documentation
Search Engine
fdfgen Ability to automatically create pdf link

documents and fill in information
pdfx Automatically extract cited references link

and download the corresponding pdf file
invoice2data Invoice pdf information extraction invoice2dat

a
PDF document github

information
extraction
PDFMiner PDFMiner can get the exact position of link

the text in the page, as well as other
information such as font or line. It also
has a PDF converter that can convert
PDF files to other text formats such as
HTML. There is also an extensible
parser PDF that can be used for other
purposes than text analysis.
PyPDF2 PyPDF 2 is a python PDF library capable link

of splitting, merging, cropping and
converting pages of PDF files. It can also
add custom data, viewing options and
passwords to PDF files. It can retrieve
text and metadata from PDFs, and can
also merge entire files together.
PyPDF2 PyPDF 2 is a python PDF library capable link
of splitting, merging, cropping and
converting pages of PDF files. It can also
add custom data, viewing options and
passwords to PDF files. It can retrieve
text and metadata from PDFs, and can
also merge entire files together.
ReportLab ReportLab can quickly create PDF link

documents. A time-proven,
super-easy-to-use open source project
for creating complex, data-driven PDF
documents and custom vector graphics.
It's free, open source, and written in
Python. With more than 50,000
downloads per month, the package is
part of standard Linux distributions,
embedded in many products, and was
chosen to power Wikipedia's print/export
functionality.
Simple PDF file github

text editor written
by SIMPdfPython
pdf-diff PDF file diff tool can display the github

difference between two pdf documents
form processing
Use unet to realize github

automatic detection of
document tables and table
reconstruction
pdftabextract Used for form information link

analysis after OCR
recognition, very powerful
tabula-py Directly convert the table

information in pdf to pandas
dataframe, there are two
versions of codes in java and
python
camelot PDF form parsing link
pdfplumber PDF form parsing
PubLayNet Able to divide paragraphs, link

identify tables, pictures
Extract tabular data from github

papers
Finding answers in tables github

with BERT
Series of articles on table Introduction to

questions and answers the end of the
model
Generate tabular data github

using GAN (English only)
carefree-learn (PyTorch) Automated Machine Learning github
(AutoML) Package for Tabular
Datasets
Closed domain fine-tuning github

table detection
PDF form data extraction github

tool
TaBERT A New Model for paper

Understanding Tabular
Data Queries
form processing Awesome-Table-Recognition github
text match
Sentence, QA similarity A collection of text similarity matching github

matching MatchZoo algorithms, including multiple deep
learning methods, worth trying.
Chinese Question Sentence github

Similarity Calculation
Competition and Scheme
Summary
similarity similarity Written in java, it is used for similarity github

calculation toolkit calculations related to words,
phrases, sentences, lexical analysis,
sentiment analysis, semantic
analysis, etc.
Chinese word similarity Combined with the word similarity gihtub

calculation method calculation method of Synonyms Cilin
Extended Edition and Hownet, the
vocabulary coverage is more and the
results are more accurate.
Python string similarity github

algorithm library
Similar sentence judgment 100,000 training samples provided github

model based on Siamese
bilstm model, providing
training data set and test
data set
Text Data Augmentation

n
Chinese NLP Data Augmentation (EDA) Tool github
English NLP data enhancement tool github
One-click Chinese data enhancement tool github
The application and effect of data enhancement in link

machine translation and other nlp tasks
NLP Data Augmentation Resource Collection github

Common regular expressions
(Name)
Regular It has been

expression to integrated into the
extract email python package
cocoNLP , welcome
to try
Extract It has been

phone_number integrated into the
python package
cocoNLP , welcome
to try
Regular IDCards_pattern =
expression for r'^([1-9]\d{5}[12]\d{3}(0[1-9]|1[01
extracting ID 2])(0[1-9]|[12][0
number -9]|3[01])\d{3}[0-9xX])
IDs =
re.findall(IDCards_pattern, text,
flags=0)
IP address regular (25[0-5]| 2[0-4]\d| [0-1]\d{2}|

expression [1-9]?\d).(25[0-5]| 2[0- 4]\d|
[0-1]\d{2}| [1-9]?\d).(25[0-5]|
2[0-4]\d| [0-1]\d {2}|
[1-9]?\d).(25[0-5]| 2[0-4]\d|
[0-1]\d{2}| [1-9]?\d )
Tencent QQ [1-9]([0-9]{5,11})
number regular
expression
Domestic [0-9-()()]{7,18}
fixed-line number
regular expression
username regex [A-Za-z0-9_-\u4e00-\u9fa5]+
Regular matching github

of domestic phone
numbers (three
major operators +
virtual, etc.)
Regular github
Expression
Tutorial
text search
Efficient Fuzzy Search Tool github
Large list/search engine of link

BERT models for various
languages/tasks
Deepmatch's deep matching github

model library for
recommendation, advertising
and search
wwsearch is a full-text search github
engine developed by the
enterprise WeChat background
aili - the fastest in-memory github

index in the East
Efficient string matching tool a fast string matching library for github
RapidFuzz Python and C++, which is using
the string similarity calculations
from FuzzyWuzzy
reading comprehension
n
Efficient Fuzzy Search Tool github
Large list/search engine of BERT models for various link

languages/tasks
Deepmatch's deep matching model library for github

recommendation, advertising and search
allennlp reading comprehension supports a variety of github

data and models
emotion analysis
aspect sentiment analysis github
package
awesome-nlp-sentiment-analysis Sentiment analysis, emotional github

cause identification, evaluation
object and evaluation word
extraction
Sentiment analysis technology github

enables intelligent customer
service to better understand
human emotions
event extraction
n
Chinese event extraction github
List of Literature Resources for NLP Event Extraction github
BERT event extraction implemented by PyTorch github

(ACE 2005 corpus)
News Event Clue Extraction github
machine translation
Resource Description Link
name (Name)
no way The command line version of Youdao Dictionary github
dictionary supports English-Chinese mutual search and online
search
NLLB Language model NLLB that supports arbitrary link

inter-translation of 200+ languages
Easy-Translat Script to translate large text files locally, based on github

e Facebook/Meta AI's M2M100 model and NLLB200
model, supports 200+ languages
digital conversion
n
The best Chinese character number (Chinese github

number)-Arabic number conversion tool
Quickly convert "Chinese numerals" and "Arabic github

numerals"
Parse and convert natural language numeric strings github

to integers and floating point numbers
anaphora resolution
n
Chinese reference to digestion github

data
baidu ink code a0qq
text clustering
n
TextCluster short text clustering preprocessing github

module Short text cluster
Text Categorization
n
NeuralNLP-NeuralClassifier Tencent open source github

deep learning text classification tool
knowledge reasoning
n
GraphbrainAI is an open source software library and github

research tools designed to facilitate automatic
meaning extraction and text understanding as well as
knowledge exploration and inference
(Harvard) free book on causal reasoning pdf

Interpretable Natural Language Processing
n
State-of-the-art interpreter library for textual machine github

learning models
text attack
TextAttack natural github

language processing
model adversarial attack
framework
OpenBackdoor: Text OpenBackdoor is developed based on github

backdoor attack and Python and PyTorch, which can be used
defense toolkit to reproduce, evaluate and develop
related algorithms for text backdoor
attack and defense
text visualization
Scattertext text github

visualization (python)
whatlies word vector spacytool
interactive visualization s
PySS3 machine github

visualization tool for SS3
text classifiers for
explainable AI
Render 3D images with github

Notepad
attnvisGPT2, BERT and github

other transformer language
models attention interactive
visualization
Texthero text data efficient Including preprocessing, keyword github

processing package extraction, named entity
recognition, vector space
analysis, text visualization, etc.
text annotation tool

n
Overview of NLP annotation platform github
brat rapid annotation tool sequence annotation tool link
Poplar web version natural language annotation tool github

LIDA is a lightweight interactive dialogue annotation github
tool
doccano is a web-based open source collaborative github

multilingual text annotation tool
Datasaurai online data labeling workflow link

management tool
language detection
Resource Description Link
name
(Name)
langid 97 https://github.com/saffsd/langid.py
languages
detected
langdetect language https://code.google.com/archive/p/language-de

detection tection/
comprehensive tool
(Name)
jieba jieba
hanlp hanlp
nlp4han Chinese natural language processing tool set github
(sentence segmentation/word
segmentation/part-of-speech
tagging/chunking/syntax analysis/semantic
analysis/NER/N-gram/HMM/pronoun
resolution/sentiment analysis/spelling check
Progress in Hate link

Speech Detection
Bert application Including named entity recognition, github

based on Pytorch sentiment analysis, text classification and
text similarity, etc.
nlp4han Chinese Sentence segmentation/word github

natural language segmentation/part-of-speech
processing toolset tagging/chunking/syntactic analysis/semantic
analysis/NER/N-gram/HMM/pronoun
resolution/sentiment analysis/spelling check
Some basic models github

of natural language
Template code for github

sequence tagging
and text
classification with
BERT
jieba_fast github
accelerated version
of jieba
Stanford NLP Pure Python version of natural language link

processing package
Python Spoken github
Natural Language
Processing Toolset
(English)
PreNLP natural github

language
preprocessing
library
Some papers and Including topic model, word vector (Word github
codes related to nlp Embedding), named entity recognition
(NER), text classification (Text Classificatin),
text generation (Text Generation), text
similarity (Text Similarity) calculation, etc.,
involving various nlp-related Algorithm,
based on keras and tensorflow
Python text github

mining/NLP practical
example
Forte's flexible and github

powerful natural
language
processing pipeline
toolset
stanza Stanford Can handle more than sixty languages github

team NLP tools
Fancy-NLP is a text github

knowledge mining
tool for building
product portraits
Comprehensive and github
easy Chinese NLP
toolkit
Recurrence of github
vectorized recall
pipelines commonly
used in the industry
based on DSSM
Texthero text data Including preprocessing, keyword extraction, github

efficient processing named entity recognition, vector space
package analysis, text visualization, etc.
nlpgnn graph neural github

network natural
language
processing toolbox
Macadam Based on Tensorflow (Keras) and github

bert4keras, a natural language processing
toolkit focusing on text classification,
sequence labeling and relation extraction
LineFlow is an github
efficient NLP data
loader for all deep
learning frameworks
Arabica: Python text github

data exploratory
analysis toolkit
Python stress github

testing tool:
SMSBoom
funny tool
(Name)
Wang Feng Lyric phunterlau/wangfeng-r

Generator nn
Analysis of github
girlfriend's
emotional
fluctuations
NLP is too github

difficult series
Variable naming github link

artifact
Image text github

removal, can be
used for manga
translation
CoupletAI - Automatic couplet system github

couplet based on
generation CNN+Bi-LSTM+Attention
Solving Complex github

Mathematical
Equations Using
Neural Network
Symbolic
Reasoning
Question Functions include Lyrics github
answering robot Solitaire, Finding Songs with
based on 14W Known Lyrics, and Questions
song knowledge and Answers about the
base Triangular Relationship of Song
Artists Lyrics
COPE - Metric github

Poem Editor
Paper2GUI An AI desktop APP toolbox for github

ordinary people. It can be used
immediately without installation.
It already supports 18+ AI
models, covering speech
synthesis, video frame
complementing, video
super-resolution, target
detection, image stylization,
OCR recognition, etc.
Politeness github paper

estimator (trained
using Sina Weibo
data)
Grass python Chinese programming homepage gitee

(Python Chinese language
version) getting
started guide
course report interview

(Name)
Natural Language link
Processing Report
Knowledge Graph link

Report
Data Mining Report link
autonomous driving link

report
Machine translation link

report
blockchain report link
robot report link
Computer Graphics link

Report
3D printing report link
Facial Recognition link

Report
Artificial Intelligence link

Chip Report
cs224n deep learning pytorch

natural language implementation of the
processing course model in the link
courselink
Processing by
Example Tutorial for
Deep Learning
Researchers
"Natural Language github

Processing" by Jacob
Eisenstein
ML-NLP Machine learning (Machine github

Learning), knowledge
points and code
implementation often
tested in NLP interviews
NLP task example github

project code set
2019 NLP Highlights download

Review
nlp-recipes produced github

by Microsoft--best
practices and
examples of natural
language processing

Processing by
Example Tutorial for
Deep Learning
Researchers
Transfer Learning in youtube
Natural Language
Processing (NLP)
Machine Learning link github

Systems book
Contest
n
Review the TOP solutions of all NLP competitions github
2019 Baidu Triple Extraction Competition, "Scientific github

Space Team" source code (7th place)
Financial Natural Language Processing

n
BDCI2019 Financial Negative Information Judgment github
Open source financial investment data extraction tool github
A large list of natural language processing research github

resources in the financial field
Chatbots based on the financial-judicial domain (with github

the nature of small talk)
Demonstration of small-scale financial knowledge github
graph construction process
Medical Natural Language Processing

(Name)
Chinese medical NLP github

public resources
arrangement
spaCy Medical Text github

Mining and
Information Extraction
Building a Model for Contains dictionaries and corpus github

Medical Entity annotations, based on python
Recognition
Question answering github This

system based on repo refers to
knowledge graph in github
medical field
Chinese medical github

dialogue data
Chinese medical
dialogue data set
A Large-Scale Contains 1.1 million medical github

Medical Dialogue consultations and 4 million
Dataset doctor-patient dialogues
Data related to New crown and other types of github
COVID-19 pneumonia Chinese medical
dialogue dataset; open data github
sources of institutions such as
Tsinghua University (COVID-19)
Legal Natural Language Processing

(Name)
Blackstone's spaCy github

pipeline and NLP
model for
unstructured legal
text
List of Forensic github

Intelligence Literature
Resources
Chatbots based on github

the financial-judicial
domain (with the
nature of small talk)
Crime Legal Terms Contains 856 crime knowledge graphs, github

and Classification crime prediction based on 2.8 million crime
Model training database, 13 types of question
classification and legal information question
and answer function based on 20W legal
question and answer pairs
text to image
(Name)
Dalle-mini A mini version of DALL·E that generates github

pictures based on text prompts
other
phone China mobile phone ls0f/phone

attribution query
phone International mobile AfterShip/phone

phone and telephone
attribution inquiry
ngender gender based on observers/ngende

name r
A summary of the differences link

between Chinese and English
NLP
Technical documents PDF or github

PPT shared by Daniel in each
major company
comparxiv is used to compare pypi

the difference between two
submitted versions on arXiv
Meta-architecture of github
CHAMELEON deep learning
news recommendation system
Automatic Resume Screening github

System
A variety of text readability github
evaluation indicators
implemented by Python
Data Science ML Full Stack Roadmap

https://github.com/hemansnation/Data-Science-ML-Full-Stack-2022
Join the Data Science & ML Full Stack WhatsApp Group Community here:
If the group is full, please join another one.
https://chat.whatsapp.com/B7Mdp6QTMJ0KZYGWrziT3Y
https://chat.whatsapp.com/HWDSJU4KXrXJIcn5Npp3Gm
https://chat.whatsapp.com/DmATV5uaVY7IKrTMHDiHnr
https://chat.whatsapp.com/Blz2n8QYSgdKWfQbJZxHtJ
Join Telegram for Data Science ML AI Resources:

https://t.me/+sREuRiFssMo4YWJl
Join Community on LinkedIn:
https://www.linkedin.com/groups/12540639/
Connect with me on these platforms:

LinkedIn: https://www.linkedin.com/in/hemansnation/
Twitter: https://twitter.com/hemansnation
GitHub: https://github.com/hemansnation
Instagram: https://www.instagram.com/masterdexter.ai/
Are you a professional?

DM for One-on-One sessions for Python, Data Science, Machine Learning,
and Data Engineering.
Here: https://bit.ly/3U6zQvQ
Python Notion Template

https://hemansnation.gumroad.com/l/god-level-python-with-himanshu-ra
mchandani

350 NLP Projects With Code

Uploaded by

Copyright:

Available Formats

350 NLP Projects With Code

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

350 NLP Projects With Code

Uploaded by

Copyright:

Available Formats

350 NLP

* Thesaurus and lexical tools * Table Processing

* Pre-trained language model * Text Matching

* Extraction * Text Data Enhancement

* Knowledge map * Text Retrieval

* Text generation * Reading Comprehension

* Text summarization * Sentiment Analysis

* Intelligent question answering * Common Regular

* Event extraction * Text annotation tool

* Machine translation * Comprehensive tool

* Digital transformation * Funny and funny tool

* Anaphora resolution * Course report interview, etc.

* Text clustering * Competition

* Text classification * Financial NLP

* Knowledge reasoning * Medical NLP

* Explainable NLP * Legal NLP

* Text adversarial attack * Text generation image

Corpus of names wainshine/Chinese-

Chinese-Word-Vector Various Chinese word vectors github repo

Chinese rumor data In this data file, each line github

Chinese Question link extract code

WeChat official The 3G corpus, which includes github

Chinese natural github

Speech Recognition Create an Automatic Speech github

LitBank NLP dataset A corpus of 100 labeled github

ChineseULMFiT Sentiment Analysis Text github

The administrative github

Chinese Natural github

Baidu Zhizhi Q&A More than 5.8 million github

Wikipedia Massively 85 languages, 1620 language github

Ancient Poetry github repo

Low memory loading Use the new version of nlp github

couplet data 700,000 couplets, more than github

700,000 couplet data link

Username Blacklist github

Dependency parsing 40,000 high-quality labeled Homepage

People's Daily github

False news dataset github

Poetry Quality github

Open tasks related to Dataset and current best github

Chinese abbreviation github

Chinese Rumor github

CLUEDatasetSearch Chinese and English NLP github

Make Everyone Transform impolite sentences paper and code

List of Chinese NLP github

Chinese Language Includes representative github

OpenCLaP Civil documents, criminal github

Chinese full word DRCD dataset: Released by github

CMRC 2018 dataset: Chinese

Dakshina dataset Latin/native script parallel github

Chinese Reading github

Chinese natural github

Chinese Language Includes representative github

Large list of NLP github

LitBank NLP dataset A corpus of 100 labeled github

700,000 couplet data github

Parallel Corpus of The short chapters include github

Thesaurus and Lexical Tools

textfilter Sensitive word observerss/textfilter

Name extraction Chinese (modern, cocoNLP