NLTK Chapter 2: corpora

Loading your own corpus

Using your corpus in NLTK (to do NLTK things like concordances, generate text, similar words, etc)

Making a NLTK Text() object

To work with your corpus and to use some of the NLTK functionalities we looked at in Chapter 1 (https://www.nltk.org/book/ch01.html), we will need to turn our corpus in a Text() object.

With this Text() object (saved in the variable my_text) we can use the functions of the NLTK library, such as word counts, generating text, word collocations, etc.

word concordances

calculated word similarity

common contexts of two words

(is only able to find common contexts if you have enough words in your corpus)

common word combincations: collocations

A collocation is a sequence of words that occur together unusually often. Thus red wine is a collocation, whereas the wine is not. A characteristic of collocations is that they are resistant to substitution with words that have similar senses; for example, maroon wine sounds definitely odd.

where do words appear in a text? dispersion plotting

generate texts based on your corpus

tokens that make your corpus

lexical diversity

word counts

word counts in another way: frequency distributions

long words

counting word lengths

filter: ends with

filter: starts with

filter: if 'XXX' is in the word

filter: if Title

filter: if digit

Looking at the NLTK example corpora

Try to load another one and see if you can find the plain text files?