350 NLP Projects With Code

Download as pdf or txt
Download as pdf or txt
You are on page 1of 70

350 NLP

Projects
with Code
The Most Powerful NLP-Weapon Arsenal

Himanshu Ramchandani
M.Tech | Data Science
NLP Migrant Workers' Paradise: Almost the most complete
Chinese NLP resource library
In the process of getting started and getting familiar with NLP, I used a lot of packages
on github, so I sorted it out and shared it here.


Many bags are very interesting and worth collecting, satisfying everyone's collection
addiction! If you find it useful, please share and star ,thanks!

❤️❤️❤️
Long-term irregular updates, welcome to watch and fork!

🍆🍒🍐🍊 🌻🍓🍈🍅🍍
* Corpus * Document Processing

* Thesaurus and lexical tools * Table Processing

* Pre-trained language model * Text Matching

* Extraction * Text Data Enhancement

* Knowledge map * Text Retrieval

* Text generation * Reading Comprehension

* Text summarization * Sentiment Analysis

* Intelligent question answering * Common Regular


Expressions
* Text error correction
* Speech Processing
* Common regular expressions * Text visualization

* Event extraction * Text annotation tool

* Machine translation * Comprehensive tool

* Digital transformation * Funny and funny tool

* Anaphora resolution * Course report interview, etc.

* Text clustering * Competition

* Text classification * Financial NLP

* Knowledge reasoning * Medical NLP

* Explainable NLP * Legal NLP

* Text adversarial attack * Text generation image

* Others

corpus
Resource name Description Link
(Name)

Corpus of names wainshine/Chinese-


Names-Corpus

Chinese-Word-Vector Various Chinese word vectors github repo


s
Chinese Chat Corpus The library includes Douban link
Duolun, PTT gossip corpus,
Qingyun corpus, TV drama
dialogue corpus, Tieba forum
reply corpus, Weibo corpus,
little yellow chicken corpus

Chinese rumor data In this data file, each line github


contains a rumor data in json
format

Chinese Question link extract code


Answering Dataset 2dva

WeChat official The 3G corpus, which includes github


account corpus some articles from WeChat
official accounts captured from
the web, has removed HTML
and only contains plain text.
One article per line, in JSON
format, name is the name of
the WeChat official account,
account is the ID of the
WeChat official account, title is
the title, and content is the text

Chinese natural github


language processing
corpus, data set
Task-based dialogue [The most complete github
English dataset task-based dialogue data set]
mainly introduces a complete
task-based dialogue data set,
which covers the main
information of all commonly
used data sets in the field of
task-based dialogue. In
addition, in order to help
researchers better grasp the
context of field progress, we
present the State-of-the-art
experimental results on
several datasets in the form of
Leaderboard.

Speech Recognition Create an Automatic Speech github


Corpus Generation Recognition (ASR) corpus
Tool from online videos with
audio/subtitles

LitBank NLP dataset A corpus of 100 labeled github


English novels supporting
natural language processing
and computational humanities
tasks

ChineseULMFiT Sentiment Analysis Text github


Classification Corpus and
Model

The administrative github


division data of
provinces,
municipalities and
towns are marked
with pinyin
Automated github
Summarization
Corpus of Education
Industry News

Chinese Natural github


Language Processing
Dataset

Baidu Zhizhi Q&A More than 5.8 million github


Corpus questions, 9.38 million
answers, 5800 classification
labels. Based on the question
and answer corpus, it can
support a variety of
applications, such as chat
question and answer, logic
mining

Wikipedia Massively 85 languages, 1620 language github


Parallel Text Corpus pairs, 135M contrasting
sentences

Ancient Poetry github repo


Thesaurus
more complete
ancient poetry
lexicon

Low memory loading Use the new version of nlp github


Wikipedia data library to load 17GB+ English
Wiki corpus and only occupy
9MB of memory Traversal
speed 2-3 Gbit/s

couplet data 700,000 couplets, more than github


700,000 couplets
"Color Dictionary" github
dataset

42GB of JD github
Customer Service
Dialogue Data
(CSDD)

700,000 couplet data link

Username Blacklist github


List

Dependency parsing 40,000 high-quality labeled Homepage


corpus data

People's Daily github


Corpus Processing
Toolset

False news dataset github


fake news corpus

Poetry Quality github


Evaluation /
Fine-grained
Emotional Poetry
Corpus

Open tasks related to Dataset and current best github


Chinese natural results
language processing

Chinese abbreviation github


dataset
Chinese task Representative dataset - github
benchmarking benchmark (pretrained) model
- corpus - baseline - toolkit -
leaderboard

Chinese Rumor github


Database

CLUEDatasetSearch Chinese and English NLP github


datasets Search all Chinese
NLP datasets, with commonly
used English NLP datasets
attached

Multi-Document github
Summarization
Dataset

Make Everyone Transform impolite sentences paper and code


"Courteous" Polite into polite ones while
Migration Quest preserving meaning, providing
a dataset with 139M+
instances

Cantonese/English github
Conversational
Bilingual Corpus

List of Chinese NLP github


datasets

Nomenclature github
recognition data set
of person-like
names/place
names/organization
names

Chinese Language Includes representative github


Comprehension datasets & benchmark models
Benchmark & corpora & leaderboards

OpenCLaP Civil documents, criminal github


multi-domain open documents, Baidu
source Chinese Encyclopedia
pre-trained language
model warehouse

Chinese full word DRCD dataset: Released by github


coverage BERT and Delta Research Institute of
two reading Taiwan, China, it has the same
comprehension data form as SQuAD, and is an
extractive reading
comprehension dataset based
on traditional Chinese.

CMRC 2018 dataset: Chinese


machine reading
comprehension data released
by the Xunfei Joint Laboratory
of Harbin Institute of
Technology. According to a
given question, the system
needs to extract fragments
from the text as answers, in
the same form as SQuAD.

Dakshina dataset Latin/native script parallel github


dataset for twelve South Asian
languages
OPUS-100 Multilingual (100 kinds) github
parallel corpus centered on
English

Chinese Reading github


Comprehension
Dataset

Chinese natural github


language processing
vector collection

Chinese Language Includes representative github


Comprehension datasets, benchmark
Benchmark (pretrained) models, corpora,
leaderboards

Large list of NLP github


datasets/benchmark
tasks

LitBank NLP dataset A corpus of 100 labeled github


English novels supporting
natural language processing
and computational humanities
tasks

700,000 couplet data github

Parallel Corpus of The short chapters include github


Classical Chinese "The Analects of Confucius",
(Ancient "Mencius", "Zuo Zhuan" and
Chinese)-Modern other short ancient books,
Chinese which have been merged with
"Zi Zhi Tong Jian"
COLDDateset, Covers topics such as race, paper
Chinese Offensive gender, and region, and the
Language Detection data will be released after the
Dataset paper is published

Thesaurus and Lexical Tools


Resource name Description Link
(Name)

textfilter Sensitive word observerss/textfilter


filtering in Chinese
and English

Name extraction Chinese (modern, cocoNLP


function ancient) names,
Japanese names,
Chinese surnames
and first names,
titles (big aunt,
little aunt, etc.),
English ->
Chinese name
(John Lee), idiom
dictionary

Chinese National People's github


Abbreviation Library Congress:
National People's
Congress; China:
People's Republic
of China;
Women's Tennis:
Women/n Tennis/n
Game/vn
Chinese Dictionaries How to dismantle kfcd/chaizi
Chinese
characters (1)
How to dismantle
(2) How to
dismantle (3)

Lexical Sentiment Mountain spring rainarch/SentiBridge


Value water:
0.400704566541
Sufficient

: 0.37006739587

Chinese thesaurus, dongxiexidian/Chinese


stop words,
sensitive words

python-pinyin Convert Chinese mozillazg/python-pinyin


characters to
Pinyin

zhtools Conversion skydark/nstools


between
Traditional and
Simplified Chinese

English simulation say wo i ni #say: I tinyfool/ChineseWithEnglish


Chinese love you
pronunciation
engine

chinese_dictionary Thesaurus, guotong1988/chinese_dictionary


antonym, negative
thesaurus
wordninja English string wordninja
segmentation and
word extraction
without spaces

Vocabulary related data


to automobile brand
and automobile
parts

Thesaurus IT thesaurus, link


organized by THU financial
thesaurus, idiom
thesaurus, place
names, historical
celebrity
thesaurus, poetry
thesaurus,
medical
thesaurus, diet
thesaurus, legal
thesaurus,
automobile
thesaurus, animal
thesaurus
Crime Legal Terms Contains 856 github
and Classification crime knowledge
Model graphs, crime
prediction based
on 2.8 million
crime training
database, 13
types of question
classification and
legal information
question and
answer function
based on 20W
legal question and
answer pairs

Word segmentation Baidu network disk link -


corpus + code extraction code pea6

Chinese word keras link


segmentation + implementation
part-of-speech
tagging based on
Bi-LSTM + CRF

Chinese word link


segmentation and
part-of-speech
tagging based on
Universal
Transformer + CRF

Fast Neural Network java version


Word Segmentation
Package
chinese-xinhua Zhonghua Xinhua github
dictionary
database and api,
including
commonly used
Xiehouyu, idioms,
words and
Chinese
characters

SpaCy Chinese Contains Parser, github


model NER, syntax tree
and other
functions. Some
English packages
use spacy's
English model. If
you want to adapt
to Chinese, you
may need to use
spacy's Chinese
model.

Chinese character github


data

Synonyms Chinese github


Synonym Toolkit

Harvest Text Domain adaptive github


text mining tools
(new word
discovery-sentime
nt analysis-entity
linking, etc.)
word2word Easy-to-use github
multilingual
word-word pair set
62
languages/3,564
multilingual pairs

Polyphone github
dictionary data and
codes

Chinese characters, github


words, idioms query
interface

103976 English (sql version, csv github


vocabulary packs version, Excel
version)

Big list of swear github


words in English

word pinyin data github

Number calling github


library in 186
languages

Large-scale name github


database of
countries around the
world
Chinese character Extract the github
feature extractor features of
(featurizer) Chinese
characters
(pronunciation
features, font
features) for deep
learning features

char_featurizer - github
Chinese character
feature extraction
tool

Python interface github


library of mecab, the
CJK word
segmentation library

g2pC context-based github


Chinese
pronunciation
automatic marking
module

ssc, Sound Shape Phonetic code - version 1


Code Chinese character
string similarity version 2
calculation method
blog/introduction
based on
"phonetic code"
Acquisition of github
multiple
meanings/sense
items of Chinese
words and semantic
disambiguation of
specific sentences
based on the
encyclopedia
knowledge base

Tokenizer is a fast github


and customizable
text tokenization
library

Tokenizers State-of-the-art github


tokenizer with a
focus on
performance and
versatility

Realize text "face github


changing" through
synonym
replacement

token2index is a github
powerful lightweight
term index library
compatible with
PyTorch/Tensorflow

Traditional and github


Simplified
Conversion
Cantonese NLP github
Tools

domain dictionary Professional github


dictionary
knowledge base
covering 68 fields
with a total of 9.16
million words

Pre-trained language model & large model


Resource name (Name) Description Link

BMList Big Model Big List github

Chinese translation of bert link


papers

The slides of the original link


author of bert

Text Classification Practice github

bert tutorial text github


classification tutorial

Bert pytorch github


implementation

Bert pytorch github


implementation
BERT generates sentence github
vectors, BERT does text
classification and text
similarity calculation

Diagram of bert and ELMO github

BERT Pre-trained models github


and downstream
applications

Language/Knowledge github
Representation Tool BERT
& ERNIE

Using the gpt-2 language github


model in Kashgari

Facebook LAMA Probes for analyzing factual and github


commonsense knowledge contained
in pretrained language models.
Language model analysis, providing a
unified access interface for
Transformer-XL/BERT/ELMo/GPT
pre-trained language models

Chinese GPT2 training github


code

XLMFacebook's github
cross-language pre-trained
language model

Massive Chinese github


pre-trained ALBERT model
Transformers 20 Supports TensorFlow 20 and github
PyTorch's natural language
processing pre-trained language
models (BERT, GPT-2, RoBERTa,
XLM, DistilBert, XLNet...) 8
architectures/33 pre-trained
models/102 languages

8 papers sort out the github


progress and reflection of
BERT related models

French RoBERTa French RoBERTa pre-trained link


pre-trained language language model trained with 138GB
model corpus

Chinese pre-trained Pretrain Chinese Model based on github


ELECTREA model confrontational learning

albert-chinese-ner Use the pre-trained language model github


ALBERT to do Chinese NER

Open source pre-trained github


language model collection

Chinese ELECTRA github


pre-training model

Predicting Next Word with github


Transformers (BERT,
XLNet, Bart, Electra,
Roberta, XLM-Roberta)
(Model Comparison)

TensorFlow Hub New language models for 40+ link


languages ​(including Chinese)
UER Chinese pre-trained model github
warehouses based on different
corpora, encoders, and target tasks
(including BERT, GPT, ELMO, etc.)

Open source pre-trained github


language model collection

Multilingual sentence github


vector package

Language Model as a Language Model as a Service github


Service (LMaaS)

Open source language 20 billion parameters, currently the github


model GPT-NeoX-20B largest publicly accessible pre-trained
general autoregressive language
model

Chinese Science Literature Contains 396,209 meta-information github


Dataset (CSL) (titles, abstracts, keywords,
disciplines, categories) of papers in
Chinese core journals. The CSL
dataset can be used as a pre-training
corpus, and can also be used to
construct many NLP tasks, such as
text summarization (title prediction),
keyword generation, and text
classification.

Large model development github


artifact

extract
Resource name (Name) Description Link

time extraction It has been integrated into the java


python package cocoNLP , version
welcome to try
python
version

Neural network relationship Chinese is not supported yet github


extraction pytorch

Bert-based named entity Chinese is not supported yet github


recognition pytorch

Keyword (Keyphrase) extraction github


package pke

BLINK's most advanced entity github


link library

Named entity recognition github


implemented by BERT/CRF

Support batch parallel github


LatticeLSTM Chinese named
entity recognition

Building a Model for Medical Contains dictionaries and corpus github


Entity Recognition annotations, based on python
Pipeline Entity and Relationship - Entity and Relation Extraction github
Extraction Based on Based on TensorFlow and BERT
TensorFlow and BERT Pipeline entity and relationship
extraction based on TensorFlow
and BERT, the solution to the
information extraction task of the
2019 Language and Intelligence
Technology Competition.
Schema based Knowledge
Extraction, SKE 2019

Chinese named entity github


recognition NeuroNER vs
BertNER

Chinese Named Entity github


Recognition Based on BERT

Chinese key phrase extraction github


tool

bert tensorflow version for Chinese github


named entity recognition

bert-Kashgari Kashgari, a keras-based github


encapsulation classification and
labeling framework, can build a
classification or sequence
labeling model in a few minutes

cocoNLP Extraction of information such as github


name, address, email address,
mobile phone number, mobile
phone attribution, etc., rake
phrase extraction algorithm.
Microsoft Multilingual github
Number/Unit/Eg Date Time
Recognition Package

Baidu open source benchmark github


information extraction system

Chinese address word github


segmentation (identification and
extraction of address elements),
NER through sequence
annotation

Open Domain Text Knowledge github


Triple Extraction and
Knowledge Base Construction
Based on Dependency Syntax

Chinese keyword extraction github


method based on pre-training
model

chinese_keyphrase_extractor A tool for chinese keyphrase github


(CKPE) extraction A tool for quickly
extracting and identifying
keyphrases from natural
language text

Simple resume parser to extract github


key information from resumes

BERT-NER-Pytorch three github


different modes of BERT
Chinese NER experiments
knowledge map
Resource name (Name) Description Link

Tsinghua University Baidu, Chinese Wiki, English link


XLORE Chinese-English Wiki
cross-language
encyclopedia knowledge
map

Automatic generation of github


document maps

Question answering github


system based on
knowledge graph in This repo
medical field refers to
github

Chinese character github


relationship knowledge
map project

AmpliGraph Knowledge github


Graph Representation
Learning (Python) Library
Knowledge Graph
Concept Link Prediction

Chinese knowledge map github


materials, data and tools

Chinese Knowledge Extract triplet information and github


Graph Based on Baidu build a Chinese knowledge map
Encyclopedia
Zincbase Knowledge github
Graph Construction Toolkit

Question answering github


system based on
knowledge graph

Collation of knowledge github


map deep learning related
materials

Southeast University github


"Knowledge Graph"
graduate course (data)

Knowledge map car audio github


work project

"One Piece" Knowledge github


Graph

A dataset of 132 Covers common sense, city, link


knowledge graphs finance, agriculture, geography,
weather, social networking,
Internet of Things, medical care,
entertainment, life, business,
travel, science and education

Large-scale, structured, link


Chinese-English bilingual
COVID-19 Knowledge
Graph (COKG-19)
Event Triple Extraction github
Based on Dependency
Syntax and Semantic Role
Labeling

Abstract Knowledge The current scale is 500,000, github


Graph supporting the abstraction of
nominal entities, state
descriptions, and event actions

Large-scale Chinese github


knowledge map data 1.4
billion entities

Jiagu natural language Based on models such as github


processing tool BiLSTM, it provides functions
such as knowledge graph
relationship extraction, Chinese
word segmentation,
part-of-speech tagging, named
entity recognition, sentiment
analysis, new word discovery,
keyword text summarization,
text clustering, etc.

medical_NER - Chinese github


Medical Knowledge Graph
Named Entity Recognition

A large list of learning github


materials/datasets/tool
​resources related to
knowledge graphs
LibKGE is a knowledge github
graph embedding library
for reproducible research

Military field knowledge Including aircraft, space github


map question answering equipment, etc. 8 categories,
project based on mongodb more than 100 subcategories, a
storage total of 5,800 items of military
weapons knowledge base, the
project does not use a graph
database for storage, through
jieba to analyze questions,
identify entity items in
questions, and complete based
on query templates The query
of multiple types of questions is
mainly to provide a demo of the
question-and-answer thinking in
the industry.

Jingdong Commodity github


Knowledge Graph

Chinese Relation github


Extraction Based on
Distant Supervision

Intelligent Question github


Answering System Based
on Medical Knowledge
Graph

BLINK's most advanced github


entity link library
A small securities github
knowledge
graph/knowledge base

dstlr unstructured text github


scalable knowledge map
construction platform

Baidu Encyclopedia Using BERT-based fine-tuning github


character entry attribute and feature extraction methods
extraction for knowledge graphs

Data related to COVID-19 New crown and other types of github


pneumonia Chinese medical
dialogue dataset; open data github
sources of institutions such as
Tsinghua University (COVID-19)

DGL-KE Graph github


Embedding
Representation Learning
Algorithm

causality map method data

Causal Event Pairs Based link


on Multi-Domain Text
Datasets

text generation
Resource name (Name) Description Link
Texar Toolkit for Text github
Generation and
Beyond

Prof. Ehud Reiter's Blog link Professor Wan


Xiaojun of Peking
University strongly
recommends this blog,
which conducts in-depth
discussions and
reflections on NLG
technology, evaluation
and application.

Large list of resources github


related to text generation

Open Domain Dialogue Natural language link


Generation and Its Practice generation allows
in Microsoft Xiaoice machines to
master the ability
of automatic
creation

Text Generation Control github

A large list of natural github


language generation related
resources

Evaluating Natural link


Language Generation with
BLEURT

Automatic couplet data and Code link


robots
700,000 couplet data

Automatically generate Generating github


comments comments based
on Hacker News
article titles using
Transformer codec
model

Natural language github


generation SQL statement
(English)

Natural Language github


Generation Resource
Collection

Benchmarking Chinese github


Generation Tasks

Topic-specific text github


generation/text
augmentation based on
GPT2

Encoding, Tokenization, github


and Implementation of a
Controlled and Efficient Text
Generation Methodology

TextFooler's adversarial text github


generation module for text
classification/inference
SimBERT BERT model github
based on UniLM
idea, integrating
retrieval and
generation

New word generation and Non-existing words github


sentence making generate new
words from scratch
with GPT-2
variants, their
definitions, and
example
sentences

Automatically generate github


multiple choice questions
from text

Synthetic Data Generation github


Benchmark

text summary
Resource name (Name) Descriptio Link
n

Chinese text summarization/keyword extraction github

Automatic Summarization of Resume Based on github


Named Entity Recognition
Automatic text summarization library TextTeaser English github
only

Extractive summary extraction based on the latest github


language models such as BERT

A Comprehensive Guide to Text Summarization with link


Deep Learning in Python

(Colab) Abstract Text Summary Implementation github


Highlights (Tutorial

Smart Q&A
Resource name (Name) Description Link

Chinese chatbot Train the chatbot you want github


according to your own corpus,
which can be used in scenarios
such as intelligent customer
service, online question and
answer, intelligent chat, etc.

Interesting robot qingyun Chinese chatbot trained by github


qingyun

Open dialogue robots, github


knowledge graphs, semantic
understanding, natural language
processing tools and data
qa right robot Amodel-for-Retrivalchatbot - git
customer service robot, Chinese
Retreival chatbot (Chinese
retrieval robot)

ConvLab open source github


multi-domain end-to-end
dialogue system platform

A dialog system based on the github


latest version of rasa

Chatbots based on the github


financial-judicial domain (with
the nature of small talk)

End-to-end closed-domain github


dialogue system

MiningZhiDaoQACorpus 5.8 million Baidu Zhizhi Q&A github


data mining project, Baidu Zhizhi
Q&A corpus, including more than
5.8 million questions, each with a
question label. Based on this
question and answer corpus, it
can support a variety of
applications, such as logic
mining

GPT2 model GPT2-chitchat for github


Chinese chatting

Selection of relevant resource github


lists (Leaderboards, Datasets,
Papers) based on multiple
rounds of responses from
retrieval chatbots

Microsoft Conversational Bot github


Framework

chatbot-list Application and architecture of github


intelligent customer service and
chatbots, algorithm sharing and
introduction in the industry

Chinese medical dialogue data github


Chinese medical dialogue data
set

A Large-Scale Medical Dialogue Contains 1.1 million medical github


Dataset consultations and 4 million
doctor-patient dialogues

Large-scale cross-domain paper


Chinese task-oriented & data
multi-round dialogue dataset
and model CrossWOZ

Open source conversational github


information search platform

Contextual Interaction github


Multimodal Dialogue Challenge
2020 (DSTC9 2020)

Use Quora questions to github


paraphrase the trained T5
questions (Paraphrase)
Google releases Taskmaster-2 github
natural language task dialogue
dataset

Haystack's flexible, powerful, github


and extensible Question
Answering (QA) framework

End-to-end closed-domain github


dialogue system

Amazon releases github


knowledge-based
human-human open domain
dialogue dataset

Albert Large QA model trained github


based on Baidu webqa and
dureader dataset

CommonsenseQA link
Commonsense-Oriented
English QA Challenge

MedQuAD (English) Medical github


Question Answering Dataset

A Q&A engine using Wikipedia github


text as context, based on Albert
and Electra

A question answering attempt Functions include Lyrics github


based on the 14W song Solitaire, Finding Songs with
knowledge base Known Lyrics, and Questions
and Answers about the
Triangular Relationship of Song
Artists Lyrics

text error correction


Resource name (Name) Description Link

Chinese text error correction github


module code

English spell checking library github

Python spell checking library github

GitHub Typo Corpus Large-Scale github


GitHub Multilingual
Spelling/Grammar Error Dataset

BertPunc BERT-based github


state-of-the-art punctuation repair
model

Chinese writing proofreading tool github

Text Error Correction Literature List Chinese Spell Checking github


(CSC) and Grammatical Error
Correction (GEC)

Winner of Text Smart Proofreading It has been applied, from the link
Contest team of Soochow University
and Dharma Academy
multimodal
Resource name Description Link
(Name)

Chinese Multimodal Huawei's Noah's Ark Laboratory open github


Dataset "Wukong" source large-scale, including 100 million
text pairs

Chinese graphic The Chinese version of the CLIP github


representation pre-training model, open source multiple
pre-training model model scales, and a few lines of code can
Chinese-CLIP handle Chinese image-text representation
extraction & image-text retrieval

speech processing
Resource name (Name) Description Link

ASR Speech Dataset + github


Chinese Speech
Recognition System Based
on Deep Learning
Tsinghua University data_thchs30tgz-O
THCHS30 Chinese Speech penSLR domestic
Dataset image

data_thchs30tgz

test-noisetgz-Open
SLR domestic
image test-noisetgz

resourcetgz-OpenS
LR domestic image

resourcetgz

Free ST Chinese
Mandarin Corpus

Free ST Chinese
Mandarin Corpus

AIShell-1 open
source version
dataset-OpenSLR
domestic image

AIShell-1 open
source version
dataset

Primewords
Chinese Corpus Set
1-OpenSLR
Domestic Mirror

Primewords
Chinese Corpus Set
1

laughter detector github


Common Voice Speech Includes over 1,400 link
Recognition Dataset New hours of speech
Version samples from 42,000
contributors, covering
github

speech-aligner A tool for generating github


phoneme-level
time-aligned
annotations from
"human voice speech"
and its "language text"

ASR Speech github


Dictionary/Dictionary

Speech Sentiment Analysis github

masr Chinese speech github


recognition, providing
pre-training model, high
recognition rate

Chinese Text Normalization github


for Speech Recognition

Voice quality evaluation github


indicators (MOSNet,
BSSEval, STOI, PESQ,
SRMR)

Chinese/English github
Pronunciation Dictionary
for Speech Recognition
Multilingual speech-text Includes audio, text github
translation corpus released transcription and
by CoVoSTEFacebook English translation in 11
languages ​(French,
German, Dutch,
Russian, Spanish,
Italian, Turkish, Persian,
Swedish, Mongolian
and Chinese)

Parakeet text-to-speech github


synthesis based on
PaddlePaddle

(Java) Accurate Speech github


Natural Language
Detection Library

Multilingual speech-text github


translation corpus released
by CoVoSTEFacebook

Text-to-Speech Synthesis github


Implemented in TensorFlow
2

Python audio feature github


extraction package

ViSQOL audio quality github


perception is objective and
complete reference index,
divided into two modes:
audio and voice
zhrtvc Easy-to-use Chinese github
voice clone and
Chinese speech
synthesis system

aukit An easy-to-use speech github


processing toolbox,
including speech noise
reduction, audio format
conversion, feature
spectrum generation
and other modules

phkit An easy-to-use github


phoneme processing
toolbox, including
Chinese phonemes,
English phonemes,
text-to-pinyin, text
regularization and other
modules

zhvoice Chinese speech corpus, github


the speech is clearer
and more natural,
including 8 open source
data sets, 3200
speakers, 900 hours of
speech, 13 million
words

audio for speech behavior , binarization, speaker github


detection recognition, automatic
speech recognition,
emotion recognition and
other audio annotation
tools
Deep Learning Emotional github
Text-to-Speech Synthesis

Python audio data github


augmentation library

Audio Enhancement Based github


on Large-Scale Audio
Dataset Audioset

voice transfer github

document processing
Resource name Description Link
(Name)

LayoutLM-v3 github
Document
Understanding
Model

PyLaia Deep github


Learning Toolkit
for Handwritten
Document
Analysis

Single-document github
unsupervised
keyword
extraction
DocSearch Free github
Documentation
Search Engine

fdfgen Ability to automatically create pdf link


documents and fill in information

pdfx Automatically extract cited references link


and download the corresponding pdf file

invoice2data Invoice pdf information extraction invoice2dat


a

PDF document github


information
extraction

PDFMiner PDFMiner can get the exact position of link


the text in the page, as well as other
information such as font or line. It also
has a PDF converter that can convert
PDF files to other text formats such as
HTML. There is also an extensible
parser PDF that can be used for other
purposes than text analysis.

PyPDF2 PyPDF 2 is a python PDF library capable link


of splitting, merging, cropping and
converting pages of PDF files. It can also
add custom data, viewing options and
passwords to PDF files. It can retrieve
text and metadata from PDFs, and can
also merge entire files together.
PyPDF2 PyPDF 2 is a python PDF library capable link
of splitting, merging, cropping and
converting pages of PDF files. It can also
add custom data, viewing options and
passwords to PDF files. It can retrieve
text and metadata from PDFs, and can
also merge entire files together.

ReportLab ReportLab can quickly create PDF link


documents. A time-proven,
super-easy-to-use open source project
for creating complex, data-driven PDF
documents and custom vector graphics.
It's free, open source, and written in
Python. With more than 50,000
downloads per month, the package is
part of standard Linux distributions,
embedded in many products, and was
chosen to power Wikipedia's print/export
functionality.

Simple PDF file github


text editor written
by SIMPdfPython

pdf-diff PDF file diff tool can display the github


difference between two pdf documents

form processing
Resource name (Name) Description Link

Use unet to realize github


automatic detection of
document tables and table
reconstruction

pdftabextract Used for form information link


analysis after OCR
recognition, very powerful

tabula-py Directly convert the table


information in pdf to pandas
dataframe, there are two
versions of codes in java and
python

camelot PDF form parsing link

pdfplumber PDF form parsing

PubLayNet Able to divide paragraphs, link


identify tables, pictures

Extract tabular data from github


papers

Finding answers in tables github


with BERT

Series of articles on table Introduction to


questions and answers the end of the

model

Generate tabular data github


using GAN (English only)
carefree-learn (PyTorch) Automated Machine Learning github
(AutoML) Package for Tabular
Datasets

Closed domain fine-tuning github


table detection

PDF form data extraction github


tool

TaBERT A New Model for paper


Understanding Tabular
Data Queries

form processing Awesome-Table-Recognition github

text match
Resource name (Name) Description Link

Sentence, QA similarity A collection of text similarity matching github


matching MatchZoo algorithms, including multiple deep
learning methods, worth trying.

Chinese Question Sentence github


Similarity Calculation
Competition and Scheme
Summary

similarity similarity Written in java, it is used for similarity github


calculation toolkit calculations related to words,
phrases, sentences, lexical analysis,
sentiment analysis, semantic
analysis, etc.

Chinese word similarity Combined with the word similarity gihtub


calculation method calculation method of Synonyms Cilin
Extended Edition and Hownet, the
vocabulary coverage is more and the
results are more accurate.

Python string similarity github


algorithm library

Similar sentence judgment 100,000 training samples provided github


model based on Siamese
bilstm model, providing
training data set and test
data set

Text Data Augmentation


Resource name (Name) Descriptio Link
n

Chinese NLP Data Augmentation (EDA) Tool github

English NLP data enhancement tool github

One-click Chinese data enhancement tool github

The application and effect of data enhancement in link


machine translation and other nlp tasks

NLP Data Augmentation Resource Collection github


Common regular expressions
Resource name Description Link
(Name)

Regular It has been


expression to integrated into the
extract email python package
cocoNLP , welcome
to try

Extract It has been


phone_number integrated into the
python package
cocoNLP , welcome
to try

Regular IDCards_pattern =
expression for r'^([1-9]\d{5}[12]\d{3}(0[1-9]|1[01
extracting ID 2])(0[1-9]|[12][0
number -9]|3[01])\d{3}[0-9xX])

IDs =
re.findall(IDCards_pattern, text,
flags=0)

IP address regular (25[0-5]| 2[0-4]\d| [0-1]\d{2}|


expression [1-9]?\d).(25[0-5]| 2[0- 4]\d|
[0-1]\d{2}| [1-9]?\d).(25[0-5]|
2[0-4]\d| [0-1]\d {2}|
[1-9]?\d).(25[0-5]| 2[0-4]\d|
[0-1]\d{2}| [1-9]?\d )

Tencent QQ [1-9]([0-9]{5,11})
number regular
expression
Domestic [0-9-()()]{7,18}
fixed-line number
regular expression

username regex [A-Za-z0-9_-\u4e00-\u9fa5]+

Regular matching github


of domestic phone
numbers (three
major operators +
virtual, etc.)

Regular github
Expression
Tutorial

text search
Resource name (Name) Description Link

Efficient Fuzzy Search Tool github

Large list/search engine of link


BERT models for various
languages/tasks

Deepmatch's deep matching github


model library for
recommendation, advertising
and search
wwsearch is a full-text search github
engine developed by the
enterprise WeChat background

aili - the fastest in-memory github


index in the East

Efficient string matching tool a fast string matching library for github
RapidFuzz Python and C++, which is using
the string similarity calculations
from FuzzyWuzzy

reading comprehension
Resource name (Name) Descriptio Link
n

Efficient Fuzzy Search Tool github

Large list/search engine of BERT models for various link


languages/tasks

Deepmatch's deep matching model library for github


recommendation, advertising and search

allennlp reading comprehension supports a variety of github


data and models

emotion analysis
Resource name (Name) Description Link
aspect sentiment analysis github
package

awesome-nlp-sentiment-analysis Sentiment analysis, emotional github


cause identification, evaluation
object and evaluation word
extraction

Sentiment analysis technology github


enables intelligent customer
service to better understand
human emotions

event extraction
Resource name (Name) Descriptio Link
n

Chinese event extraction github

List of Literature Resources for NLP Event Extraction github

BERT event extraction implemented by PyTorch github


(ACE 2005 corpus)

News Event Clue Extraction github

machine translation
Resource Description Link
name (Name)
no way The command line version of Youdao Dictionary github
dictionary supports English-Chinese mutual search and online
search

NLLB Language model NLLB that supports arbitrary link


inter-translation of 200+ languages

Easy-Translat Script to translate large text files locally, based on github


e Facebook/Meta AI's M2M100 model and NLLB200
model, supports 200+ languages

digital conversion
Resource name (Name) Descriptio Link
n

The best Chinese character number (Chinese github


number)-Arabic number conversion tool

Quickly convert "Chinese numerals" and "Arabic github


numerals"

Parse and convert natural language numeric strings github


to integers and floating point numbers

anaphora resolution
Resource name (Name) Descriptio Link
n

Chinese reference to digestion github


data
baidu ink code a0qq

text clustering
Resource name (Name) Descriptio Link
n

TextCluster short text clustering preprocessing github


module Short text cluster

Text Categorization
Resource name (Name) Descriptio Link
n

NeuralNLP-NeuralClassifier Tencent open source github


deep learning text classification tool

knowledge reasoning
Resource name (Name) Descriptio Link
n

GraphbrainAI is an open source software library and github


research tools designed to facilitate automatic
meaning extraction and text understanding as well as
knowledge exploration and inference

(Harvard) free book on causal reasoning pdf


Interpretable Natural Language Processing
Resource name (Name) Descriptio Link
n

State-of-the-art interpreter library for textual machine github


learning models

text attack
Resource name (Name) Description Link

TextAttack natural github


language processing
model adversarial attack
framework

OpenBackdoor: Text OpenBackdoor is developed based on github


backdoor attack and Python and PyTorch, which can be used
defense toolkit to reproduce, evaluate and develop
related algorithms for text backdoor
attack and defense

text visualization
Resource name (Name) Description Link

Scattertext text github


visualization (python)
whatlies word vector spacytool
interactive visualization s

PySS3 machine github


visualization tool for SS3
text classifiers for
explainable AI

Render 3D images with github


Notepad

attnvisGPT2, BERT and github


other transformer language
models attention interactive
visualization

Texthero text data efficient Including preprocessing, keyword github


processing package extraction, named entity
recognition, vector space
analysis, text visualization, etc.

text annotation tool


Resource name (Name) Descriptio Link
n

Overview of NLP annotation platform github

brat rapid annotation tool sequence annotation tool link

Poplar web version natural language annotation tool github


LIDA is a lightweight interactive dialogue annotation github
tool

doccano is a web-based open source collaborative github


multilingual text annotation tool

Datasaurai online data labeling workflow link


management tool

language detection
Resource Description Link
name
(Name)

langid 97 https://github.com/saffsd/langid.py
languages
​detected

langdetect language https://code.google.com/archive/p/language-de


detection tection/

comprehensive tool
Resource name Description Link
(Name)

jieba jieba

hanlp hanlp
nlp4han Chinese natural language processing tool set github
(sentence segmentation/word
segmentation/part-of-speech
tagging/chunking/syntax analysis/semantic
analysis/NER/N-gram/HMM/pronoun
resolution/sentiment analysis/spelling check

Progress in Hate link


Speech Detection

Bert application Including named entity recognition, github


based on Pytorch sentiment analysis, text classification and
text similarity, etc.

nlp4han Chinese Sentence segmentation/word github


natural language segmentation/part-of-speech
processing toolset tagging/chunking/syntactic analysis/semantic
analysis/NER/N-gram/HMM/pronoun
resolution/sentiment analysis/spelling check

Some basic models github


of natural language

Template code for github


sequence tagging
and text
classification with
BERT

jieba_fast github
accelerated version
of jieba

Stanford NLP Pure Python version of natural language link


processing package
Python Spoken github
Natural Language
Processing Toolset
(English)

PreNLP natural github


language
preprocessing
library

Some papers and Including topic model, word vector (Word github
codes related to nlp Embedding), named entity recognition
(NER), text classification (Text Classificatin),
text generation (Text Generation), text
similarity (Text Similarity) calculation, etc.,
involving various nlp-related Algorithm,
based on keras and tensorflow

Python text github


mining/NLP practical
example

Forte's flexible and github


powerful natural
language
processing pipeline
toolset

stanza Stanford Can handle more than sixty languages github


team NLP tools

Fancy-NLP is a text github


knowledge mining
tool for building
product portraits
Comprehensive and github
easy Chinese NLP
toolkit

Recurrence of github
vectorized recall
pipelines commonly
used in the industry
based on DSSM

Texthero text data Including preprocessing, keyword extraction, github


efficient processing named entity recognition, vector space
package analysis, text visualization, etc.

nlpgnn graph neural github


network natural
language
processing toolbox

Macadam Based on Tensorflow (Keras) and github


bert4keras, a natural language processing
toolkit focusing on text classification,
sequence labeling and relation extraction

LineFlow is an github
efficient NLP data
loader for all deep
learning frameworks

Arabica: Python text github


data exploratory
analysis toolkit

Python stress github


testing tool:
SMSBoom
funny tool
Resource name Description Link
(Name)

Wang Feng Lyric phunterlau/wangfeng-r


Generator nn

Analysis of github
girlfriend's
emotional
fluctuations

NLP is too github


difficult series

Variable naming github link


artifact

Image text github


removal, can be
used for manga
translation

CoupletAI - Automatic couplet system github


couplet based on
generation CNN+Bi-LSTM+Attention

Solving Complex github


Mathematical
Equations Using
Neural Network
Symbolic
Reasoning
Question Functions include Lyrics github
answering robot Solitaire, Finding Songs with
based on 14W Known Lyrics, and Questions
song knowledge and Answers about the
base Triangular Relationship of Song
Artists Lyrics

COPE - Metric github


Poem Editor

Paper2GUI An AI desktop APP toolbox for github


ordinary people. It can be used
immediately without installation.
It already supports 18+ AI
models, covering speech
synthesis, video frame
complementing, video
super-resolution, target
detection, image stylization,
OCR recognition, etc.

Politeness github paper


estimator (trained
using Sina Weibo
data)

Grass python Chinese programming homepage gitee


(Python Chinese language
version) getting
started guide

course report interview


Resource name Description Link
(Name)
Natural Language link
Processing Report

Knowledge Graph link


Report

Data Mining Report link

autonomous driving link


report

Machine translation link


report

blockchain report link

robot report link

Computer Graphics link


Report

3D printing report link

Facial Recognition link


Report

Artificial Intelligence link


Chip Report

cs224n deep learning pytorch


natural language implementation of the
processing course model in the link
courselink
Natural Language github
Processing by
Example Tutorial for
Deep Learning
Researchers

"Natural Language github


Processing" by Jacob
Eisenstein

ML-NLP Machine learning (Machine github


Learning), knowledge
points and code
implementation often
tested in NLP interviews

NLP task example github


project code set

2019 NLP Highlights download


Review

nlp-recipes produced github


by Microsoft--best
practices and
examples of natural
language processing

Natural Language github


Processing by
Example Tutorial for
Deep Learning
Researchers
Transfer Learning in youtube
Natural Language
Processing (NLP)

Machine Learning link github


Systems book

Contest
Resource name (Name) Descriptio Link
n

Review the TOP solutions of all NLP competitions github

2019 Baidu Triple Extraction Competition, "Scientific github


Space Team" source code (7th place)

Financial Natural Language Processing


Resource name (Name) Descriptio Link
n

BDCI2019 Financial Negative Information Judgment github

Open source financial investment data extraction tool github

A large list of natural language processing research github


resources in the financial field

Chatbots based on the financial-judicial domain (with github


the nature of small talk)
Demonstration of small-scale financial knowledge github
graph construction process

Medical Natural Language Processing


Resource name Description Link
(Name)

Chinese medical NLP github


public resources
arrangement

spaCy Medical Text github


Mining and
Information Extraction

Building a Model for Contains dictionaries and corpus github


Medical Entity annotations, based on python
Recognition

Question answering github This


system based on repo refers to
knowledge graph in github
medical field

Chinese medical github


dialogue data
Chinese medical
dialogue data set

A Large-Scale Contains 1.1 million medical github


Medical Dialogue consultations and 4 million
Dataset doctor-patient dialogues
Data related to New crown and other types of github
COVID-19 pneumonia Chinese medical
dialogue dataset; open data github
sources of institutions such as
Tsinghua University (COVID-19)

Legal Natural Language Processing


Resource name Description Link
(Name)

Blackstone's spaCy github


pipeline and NLP
model for
unstructured legal
text

List of Forensic github


Intelligence Literature
Resources

Chatbots based on github


the financial-judicial
domain (with the
nature of small talk)

Crime Legal Terms Contains 856 crime knowledge graphs, github


and Classification crime prediction based on 2.8 million crime
Model training database, 13 types of question
classification and legal information question
and answer function based on 20W legal
question and answer pairs

text to image
Resource name Description Link
(Name)

Dalle-mini A mini version of DALL·E that generates github


pictures based on text prompts

other
Resource name (Name) Description Link

phone China mobile phone ls0f/phone


attribution query

phone International mobile AfterShip/phone


phone and telephone
attribution inquiry

ngender gender based on observers/ngende


name r

A summary of the differences link


between Chinese and English
natural language processing
NLP

Technical documents PDF or github


PPT shared by Daniel in each
major company

comparxiv is used to compare pypi


the difference between two
submitted versions on arXiv

Meta-architecture of github
CHAMELEON deep learning
news recommendation system

Automatic Resume Screening github


System
A variety of text readability github
evaluation indicators
implemented by Python

Data Science ML Full Stack Roadmap


https://github.com/hemansnation/Data-Science-ML-Full-Stack-2022

Join the Data Science & ML Full Stack WhatsApp Group Community here:
If the group is full, please join another one.

https://chat.whatsapp.com/B7Mdp6QTMJ0KZYGWrziT3Y
https://chat.whatsapp.com/HWDSJU4KXrXJIcn5Npp3Gm
https://chat.whatsapp.com/DmATV5uaVY7IKrTMHDiHnr
https://chat.whatsapp.com/Blz2n8QYSgdKWfQbJZxHtJ

Join Telegram for Data Science ML AI Resources:


https://t.me/+sREuRiFssMo4YWJl
Join Community on LinkedIn:
https://www.linkedin.com/groups/12540639/

Connect with me on these platforms:


LinkedIn: https://www.linkedin.com/in/hemansnation/
Twitter: https://twitter.com/hemansnation
GitHub: https://github.com/hemansnation
Instagram: https://www.instagram.com/masterdexter.ai/

Are you a professional?


DM for One-on-One sessions for Python, Data Science, Machine Learning,
and Data Engineering.
Here: https://bit.ly/3U6zQvQ

Python Notion Template


https://hemansnation.gumroad.com/l/god-level-python-with-himanshu-ra
mchandani

You might also like