> Spans > Visualizing
Python For Data Science
Accessing spans If you're in a Jupyter notebook, use displacy.render otherwise,
use displacy.serve to start a web server and
show the visualization in your browser.
spaCy Cheat Sheet Span indices are exclusive. So doc[2:4] is a span starting at
token 2, up to – but not including! – token 4.
>>> doc = nlp("This is a text")
>>> from spacy import displacy
>>> span = doc[2:4]
Visualize dependencies
Learn spaCy online at www.DataCamp.com >>>
'a
span.text
text'
>>> doc = nlp("This is a sentence")
>>> displacy.render(doc, style="dep")
Creating a span manually
spaCy >>> from spacy.tokens import Span
#Import the Span object
>>> doc = nlp("I live in New York")
#Create a Doc object
>>> span = Span(doc, 3, 5, label="GPE")
#Span for "New York" with label GPE (geopolitical)
>>> span.text
spaCy is a free, open-source library for advanced Natural
Language 'New York’
processing (NLP) in Python. It's designed
specifically for production use and Visualize named entities
helps you build
applications that process and "understand" large volumes
>>> doc = nlp("Larry Page founded Google")
of text. Documentation: spacy.io
>>> $ pip install spacy
> Linguistic features >>> displacy.render(doc, style="ent")
>>> import spacy Attributes return label IDs. For string labels, use the attributes with an underscore. For example, token.pos_ .
Part-of-speech tags Predicted by Statistical model
> Statistical models >>> doc = nlp("This is a text.")
> Word vectors and similarity
>>> [token.pos_ for token in doc]
#Coarse-grained part-of-speech tags
Download statistical models ['DET', 'VERB', 'DET', 'NOUN', 'PUNCT']
To use word vectors, you need to install the larger models
ending in md or lg , for example en_core_web_lg .
>>> [token.tag_ for token in doc]
#Fine-grained part-of-speech tags
['DT', 'VBZ', 'DT', 'NN', '.']
Predict part-of-speech tags, dependency labels, named
entities
and more. See here for available models:
spacy.io/models
Comparing similarity
>>> $ python -m spacy download en_core_web_sm
Syntactic dependencies Predicted by Statistical model
>>> doc1 = nlp("I like cats")
>>> doc2 = nlp("I like dogs")
Check that your installed models are up to date >>> doc = nlp("This is a text.")
>>> [token.dep_ for token in doc]
#Dependency labels
>>>
>>>
doc1.similarity(doc2)
#Compare 2 documents
doc1[2].similarity(doc2[2])
#Compare 2 tokens
['nsubj', 'ROOT', 'det', 'attr', 'punct']
>>> doc1[0].similarity(doc2[1:3]) # Comparetokens and spans
>>> $ python -m spacy validate >>> [token.head.text for token in doc]
#Syntactic head token (governor)
['is', 'is', 'text', 'is', 'is']
Accessing word vectors
Loading statistical models
Named entities Predicted by Statistical model
>>> doc = nlp("I like cats")
#Vector as a numpy array
>>> import spacy
>>> doc[2].vector
#The L2 norm of the token's vector
>>> nlp = spacy.load("en_core_web_sm") # Load the installed model "en_core_web_sm"
>>> doc = nlp("Larry Page founded Google")
>>> doc[2].vector_norm
>>> [(ent.text, ent.label_) for ent in doc.ents]
#Text and label of named entity span
[('Larry Page', 'PERSON'), ('Google', 'ORG')]
> Documents and tokens > Syntax iterators
Processing text
> Pipeline components
Sentences Ususally needs the dependency parser
Functions that take a Doc object, modify it and return it.
Processing text with the nlp object returns a Doc object
that holds all
>>> doc = nlp("This a sentence. This is another one.")
information about the tokens, their linguistic
features and their relationships >>> [sent.text for sent in doc.sents]
#doc.sents is a generator that yields sentence spans
['This is a sentence.', 'This is another one.']
>>> doc = nlp("This is a text")
Accessing token attributes Base noun phrases Needs the tagger and parser
>>> doc = nlp("This is a text")
Pipeline information >>> doc = nlp("I have a red car")
#doc.noun_chunks is a generator that yields spans
>>>[token.text for token in doc]
#Token texts
>>> nlp = spacy.load("en_core_web_sm")
>>> [chunk.text for chunk in doc.noun_chunks]
['This', 'is', 'a', 'text']
>>> nlp.pipe_names
['I', 'a red car']
['tagger', 'parser', 'ner']
>>> nlp.pipeline
[('tagger', <spacy.pipeline.Tagger>),
> Label explanations ('parser', <spacy.pipeline.DependencyParser>),
('ner', <spacy.pipeline.EntityRecognizer>)]
>>> spacy.explain("RB")
'adverb'
Custom components
>>> spacy.explain("GPE")
Learn Data Skills Online at
'Countries, cities, states' def custom_component(doc):
#Function that modifies the doc and returns it
print("Do something to the doc here!")
www.DataCamp.com
return doc
nlp.add_pipe(custom_component, first=True) #Add the component first in the pipeline
Components can be added first , last (default), or
before or after an existing component.
> Extension attributes > Rule-based matching > Glossary
Custom attributes that are registered on the global Doc, Token and Span classes and become available as ._ .
>>> from spacy.tokens import Doc, Token, Span
Using the matcher Tokenization
>>> doc = nlp("The sky over New York is blue")
# Matcher is initialized with the shared vocab
Segmenting text into words, punctuation etc
>>> from spacy.matcher import Matcher
Attribute extensions With default value # Each dict represents one token and its attributes
>>> matcher = Matcher(nlp.vocab)
Lemmatization
# Add with ID, optional callback and pattern(s)
# Register custom attribute on Token class
>>> pattern = [{"LOWER": "new"}, {"LOWER": "york"}]
>>> Token.set_extension("is_color", default=False)
>>> matcher.add("CITIES", None, pattern)
Assigning the base forms of words, for example:
# Overwrite extension attribute with default value
# Match by calling the matcher on a Doc object
doc[6]._.is_color = True "was" → "be" or "rats" → "rat".
>>> doc = nlp("I live in New York")
>>> matches = matcher(doc)
# Matches are (match_id, start, end) tuples
Property extensions With getter and setter >>> for match_id, start, end in matches:
Sentence Boundary Detection
# Get the matched span by slicing the Doc
span = doc[start:end]
# Register custom attribute on Doc class
print(span.text)
Finding and segmenting individual sentences.
>>> get_reversed = lambda doc: doc.text[::-1]
'New York'
>>> Doc.set_extension("reversed", getter=get_reversed)
# Compute value of extension attribute with getter
Part-of-speech (POS) Tagging
>>> doc._.reversed
'eulb si kroY weN revo yks ehT'
Token patterns
Assigning word types to tokens like verb or noun.
# "love cats", "loving cats", "loved cats"
Method extensions Callable Method >>> pattern1 = [{"LEMMA": "love"}, {"LOWER": "cats"}]
# "10 people", "twenty people"
>>> pattern2 = [{"LIKE_NUM": True}, {"TEXT": "people"}]
Dependency Parsing
# Register custom attribute on Span class
# "book", "a cat", "the sea" (noun + optional article)
>>> has_label = lambda span, label: span.label_ == label
>>> pattern3 = [{"POS": "DET", "OP": "?"}, {"POS": "NOUN"}]
>>> Span.set_extension("has_label", method=has_label)
Assigning syntactic dependency labels,
# Compute value of extension attribute with method
describing the relations between individual
>>> doc[3:5].has_label("GPE")
True
Operators and quantifiers tokens, like subject or object.
Can be added to a token dict as the "OP" key
Named Entity Recognition (NER)
! Negate pattern and match exactly 0 times
Labeling named "real-world" objects,
? Make pattern optional and match 0 or 1 times
like persons, companies or locations.
+ Require pattern to match 1 or more times
Allow pattern to match 0 or more time
Text Classification
*
Assigning categories or labels to a whole
document, or parts of a document.
Statistical model
Process for making predictions based on
examples.
Training
Updating a statistical model with new examples.
Learn Data Skills Online at
www.DataCamp.com