from nltk.book import *
*** Introductory Examples for the NLTK Book *** Loading text1, ..., text9 and sent1, ..., sent9 Type the name of the text or sentence to view it. Type: 'texts()' or 'sents()' to list the materials. text1: Moby Dick by Herman Melville 1851 text2: Sense and Sensibility by Jane Austen 1811 text3: The Book of Genesis text4: Inaugural Address Corpus text5: Chat Corpus text6: Monty Python and the Holy Grail text7: Wall Street Journal text8: Personals Corpus text9: The Man Who Was Thursday by G . K . Chesterton 1908
text1
<Text: Moby Dick by Herman Melville 1851>
import nltk
text4
<Text: Inaugural Address Corpus>
text2
<Text: Sense and Sensibility by Jane Austen 1811>
text1.concordance("monstrous")
Displaying 11 of 11 matches: ong the former , one was of a most monstrous size . ... This came towards us , ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r ll over with a heathenish array of monstrous clubs and spears . Some were thick d as you gazed , and wondered what monstrous cannibal and savage could ever hav that has survived the flood ; most monstrous and most mountainous ! That Himmal they might scout at Moby Dick as a monstrous fable , or still worse and more de th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l ing Scenes . In connexion with the monstrous pictures of whales , I am strongly ere to enter upon those still more monstrous stories of them which are to be fo ght have been rummaged out of this monstrous cabinet there is no telling . But of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u
text1.concordance("vivid")
Displaying 9 of 9 matches: indeed , literally justified by his vivid aspect , when seen gliding at high n ammock by exhausting and intolerably vivid dreams of the night , which , resumi yet they have nothing like a fixed , vivid conception of those perils , and the mpanions of this figure were of that vivid , tiger - yellow complexion peculiar se allusions of his were at times so vivid and life - like , that they would ca ern , the Pequod at last shot by the vivid green Cockatoo Point on the Sumatra s of effulgences . That unblinkingly vivid Japanese sun seems the blazing focus denly admonished to vigilance by the vivid lightning that had just been darting others , shading their eyes from the vivid sunlight , sat far out on the rockin
text3.concordance("lived")
Displaying 25 of 38 matches: ay when they were created . And Adam lived an hundred and thirty years , and be ughters : And all the days that Adam lived were nine hundred and thirty yea and nd thirty yea and he died . And Seth lived an hundred and five years , and bega ve years , and begat Enos : And Seth lived after he begat Enos eight hundred an welve years : and he died . And Enos lived ninety years , and begat Cainan : An years , and begat Cainan : And Enos lived after he begat Cainan eight hundred ive years : and he died . And Cainan lived seventy years and begat Mahalaleel : rs and begat Mahalaleel : And Cainan lived after he begat Mahalaleel eight hund years : and he died . And Mahalaleel lived sixty and five years , and begat Jar s , and begat Jared : And Mahalaleel lived after he begat Jared eight hundred a and five yea and he died . And Jared lived an hundred sixty and two years , and o years , and he begat Eno And Jared lived after he begat Enoch eight hundred y and two yea and he died . And Enoch lived sixty and five years , and begat Met ; for God took him . And Methuselah lived an hundred eighty and seven years , , and begat Lamech . And Methuselah lived after he begat Lamech seven hundred nd nine yea and he died . And Lamech lived an hundred eighty and two years , an ch the LORD hath cursed . And Lamech lived after he begat Noah five hundred nin naan shall be his servant . And Noah lived after the flood three hundred and fi xad two years after the flo And Shem lived after he begat Arphaxad five hundred at sons and daughters . And Arphaxad lived five and thirty years , and begat Sa ars , and begat Salah : And Arphaxad lived after he begat Salah four hundred an begat sons and daughters . And Salah lived thirty years , and begat Eber : And y years , and begat Eber : And Salah lived after he begat Eber four hundred and begat sons and daughters . And Eber lived four and thirty years , and begat Pe y years , and begat Peleg : And Eber lived after he begat Peleg four hundred an
text2.concordance("affection")
Displaying 25 of 79 matches: , however , and , as a mark of his affection for the three girls , he left them t . It was very well known that no affection was ever supposed to exist between deration of politeness or maternal affection on the side of the former , the tw d the suspicion -- the hope of his affection for me may warrant , without impru hich forbade the indulgence of his affection . She knew that his mother neither rd she gave one with still greater affection . Though her late conversation wit can never hope to feel or inspire affection again , and if her home be uncomfo m of the sense , elegance , mutual affection , and domestic comfort of the fami , and which recommended him to her affection beyond every thing else . His soci ween the parties might forward the affection of Mr . Willoughby , an equally st the most pointed assurance of her affection . Elinor could not be surprised at he natural consequence of a strong affection in a young and ardent mind . This opinion . But by an appeal to her affection for her mother , by representing t every alteration of a place which affection had established as perfect with hi e will always have one claim of my affection , which no other can possibly shar f the evening declared at once his affection and happiness . " Shall we see you ause he took leave of us with less affection than his usual behaviour has shewn ness ." " I want no proof of their affection ," said Elinor ; " but of their en onths , without telling her of his affection ;-- that they should part without ould be the natural result of your affection for her . She used to be all unres distinguished Elinor by no mark of affection . Marianne saw and listened with i th no inclination for expense , no affection for strangers , no profession , an till distinguished her by the same affection which once she had felt no doubt o al of her confidence in Edward ' s affection , to the remembrance of every mark was made ? Had he never owned his affection to yourself ?" " Oh , no ; but if
text4.concordance("god")
Displaying 25 of 108 matches: eliance on the protection of Almighty God , I shall forthwith commence the duti humble , acknowledged dependence upon God and His overruling providence . We ha great office I must humbly invoke the God of our fathers for wisdom and firmnes d the same Bible and pray to the same God , and each invokes His aid against th hat any men should dare to ask a just God ' s assistance in wringing their brea offenses which , in the providence of God , must needs come , but which , havin butes which the believers in a living God always ascribe to Him ? Fondly do we war may speedily pass away . Yet , if God wills that it continue until all the r all , with firmness in the right as God gives us to see the right , let us st the prayers of the nation to Almighty God in behalf of this consummation . Fell r , they have " followed the light as God gave them to see the light ." They ar ess their fathers and their fathers ' God that the Union was preserved , that s the support and blessings of Almighty God . Fellow citizens , in the presence o ng the power and goodness of Almighty God , who presides over the destiny of na expect the favor and help of Almighty God -- that He will give to me wisdom , s suggestion to enterprise and labor . God has placed upon our head a diadem and urn than the pledge I now give before God and these witnesses of unreserved and han human life can escape the laws of God and nature . Manifestly nothing is mo and invoking the guidance of Almighty God . Our faith teaches that there is no re is no safer reliance than upon the God of our fathers , who has so singularl e the direction and favor of Almighty God . I should shrink from the duties thi devolve upon it , and in the fear of God will " take occasion by the hand and citizens and the aid of the Almighty God in the discharge of my responsible du our heartstrings like some air out of God ' s own presence , where justice and forward - looking men , to my side . God helping me , I will not fail them , i
text4.concordance("Almighty God")
no matches
text5.concordance("lol")
Displaying 25 of 822 matches: ast PART 24 / m boo . 26 / m and sexy lol U115 boo . JOIN PART he drew a girl w ope he didnt draw a penis PART ewwwww lol & a head between her legs JOIN JOIN s a bowl i got a blunt an a bong ...... lol JOIN well , glad it worked out my cha e " PART Hi U121 in ny . ACTION would lol @ U121 . . . but appearently she does 30 make sure u buy a nice ring for U6 lol U7 Hi U115 . ACTION isnt falling for didnt ya hear !!!! PART JOIN geeshhh lol U6 PART hes deaf ppl here dont get it es nobody here i wanna misbeahve with lol JOIN so read it . thanks U7 .. Im hap ies want to chat can i talk to him !! lol U121 !!! forwards too lol JOIN ALL PE k to him !! lol U121 !!! forwards too lol JOIN ALL PErvs ... redirect to U121 ' loves ME the most i love myself JOIN lol U44 how do u know that what ? jerkett ng wrong ... i can see it in his eyes lol U20 = fiance Jerketts lmao wtf yah I cooler by the minute what 'd I miss ? lol noo there too much work ! why not ?? that mean I want you ? U6 hello room lol U83 and this .. has been the grammar the rule he 's in PM land now though lol ah ok i wont bug em then someone wann flight to hell :) lmao bbl maybe PART LOL lol U7 it was me , U83 hahah U83 ! 80 ht to hell :) lmao bbl maybe PART LOL lol U7 it was me , U83 hahah U83 ! 808265 082653953 K-Fed got his ass kicked .. Lol . ACTION laughs . i got a first class . i got a first class ticket to hell lol U7 JOIN any texas girls in here ? any . whats up U155 i was only kidding . lol he 's a douchebag . Poor U121 i 'm bo ??? sits with U30 Cum to my shower . lol U121 . ACTION U1370 watches his nads ur nad with a stick . ca u U23 ewwww lol *sniffs* ewwwwww PART U115 ! owww spl ACTION is resisting . ur female right lol U115 beeeeehave Remember the LAst tim pm's me . charge that is 1.99 / min . lol @ innocent hahah lol .... yeah LOLOLO is 1.99 / min . lol @ innocent hahah lol .... yeah LOLOLOLLL U12 thats not nic s . lmao no U115 Check my record . :) Lol lick em U7 U23 how old r u lol Way to
text1.similar("monstrous")
true contemptible christian abundant few part mean careful puzzled mystifying passing curious loving wise doleful gamesome singular delightfully perilous fearless
text2.similar("monstrous")
very so exceedingly heartily a as good great extremely remarkably sweet vast amazingly
text1.similar("monstrous")
true contemptible christian abundant few part mean careful puzzled mystifying passing curious loving wise doleful gamesome singular delightfully perilous fearless
#to examine just the contexts that are shared by two or more words, such as monstrous and very
text2.common_contexts(["monstrous", "very"])
am_glad a_pretty a_lucky is_pretty be_glad
text2.similar("beautiful")
pretty by long large respectable good great s young important considerable is little keen tender quiet considered stout serious richer
text9.common_contexts(["beautiful", "important"])
No common contexts were found
text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])
text3.generate()
Building ngram index...
laid by her , and said unto Cain , Where art thou , and said , Go to , I will not do it for ten ' s sons ; we dreamed each man according to their generatio the firstborn said unto Laban , Because I said , Nay , but Sarah shall her name be . , duke Elah , duke Shobal , and Akan . and looked upon my affliction . Bashemath Ishmael ' s blood , but Isra for as a prince hast thou found of all the cattle in the valley , and the wo The
"laid by her , and said unto Cain , Where art thou , and said , Go to ,\nI will not do it for ten ' s sons ; we dreamed each man according to\ntheir generatio the firstborn said unto Laban , Because I said , Nay ,\nbut Sarah shall her name be . , duke Elah , duke Shobal , and Akan .\nand looked upon my affliction . Bashemath Ishmael ' s blood , but Isra\nfor as a prince hast thou found of all the cattle in the valley , and\nthe wo The"
text2.generate()
Building ngram index...
knew , had by this remembrance , and if , by rapid degrees , so long . , she could live without one another , and in her rambles . at least the last evening of a brother , could you know , from the first . Dashwood ? this gentleman himself , and must put up with a kindness which they are very much vexed at , for it -- and I shall keep it entirely . admire it ; and it is , explain the grounds , or if any place could give her ease , was a
'knew , had by this remembrance , and if , by rapid degrees , so long .\n, she could live without one another , and in her rambles . at least\nthe last evening of a brother , could you know , from the first .\nDashwood ? this gentleman himself , and must put up with a kindness\nwhich they are very much vexed at , for it -- and I shall keep it\nentirely . admire it ; and it is , explain the grounds , or if any\nplace could give her ease , was a'
print(text1.generate()) * 2
Building ngram index...
long , from one to the top - mast , and no coffin and went out a sea captain -- this peaking of the whales . , so as to preserve all his might had in former years abounding with them , they toil with their lances , strange tales of Southern whaling . at once the bravest Indians he was , after in vain strove to pierce the profundity . ? then ?" a levelled flame of pale , And give no chance , watch him ; though the line , it is to be gainsaid . have been long , from one to the top - mast , and no coffin and went out a sea captain -- this peaking of the whales . , so as to preserve all his might had in former years abounding with them , they toil with their lances , strange tales of Southern whaling . at once the bravest Indians he was , after in vain strove to pierce the profundity . ? then ?" a levelled flame of pale , And give no chance , watch him ; though the line , it is to be gainsaid . have been
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) /tmp/ipykernel_2206/887456320.py in <module> ----> 1 print(text1.generate()) * 2 TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'
print((text1.generate()) * 2)
len(text1)
260819
len(text2)
141576
len(text3)
44764
#set(text3)
#sorted(set(text3))
len(set(text3)) #counting the word types (word + punctuations) from above
2789
len(set(text3)) / len(text3) #calculating the lexical richness (each word is used 16 times aver.)
0.06230453042623537
# >> which is 6%
text3.count("smote") #how many times a specific word is used in a text
5
100 * text3.count("smote") / len(text3) #what % of the text is taken by a specific word
0.01116968992940756
text5.count("lol")
704
100 * text5.count("lol") / len(text5)
1.5640968673628082
#define a task:
def lexical_diversity(text): #variable (parameter) is text
return len(set(text)) / len(text)
def percentage(count, total): #specified to take 2 variables
return 100 * count / total
#function that merges the two activities upstairs :)
lexical_diversity(text1)
0.07406285585022564
percentage(2, 4) #the data inside - that is the argument e.g. 2, 4
50.0
percentage(text4.count('a'), len(text4)) #(how many times a specific word is used in text), (len of the whole text)
1.457973123627309
2 A Closer Look at Python: Texts as Lists of Words
# 2.1 Lists
sent1 = ['Call', 'me', 'Ishmael', '.'] #how we represent text in Python; this is a list in Python
sent1
['Call', 'me', 'Ishmael', '.']
len(sent1)
4
lexical_diversity(sent1)
1.0
print
<function print>
sent2
['The', 'family', 'of', 'Dashwood', 'had', 'long', 'been', 'settled', 'in', 'Sussex', '.']
sent3
['In', 'the', 'beginning', 'God', 'created', 'the', 'heaven', 'and', 'the', 'earth', '.']
ex1 = ['Welcome', 'to', 'the', 'world', 'of', 'bees']
ex1
['Welcome', 'to', 'the', 'world', 'of', 'bees']
print(ex1)
['Welcome', 'to', 'the', 'world', 'of', 'bees']
len(ex1)
6
lexical_diversity(ex1)
1.0
ex2 = ['they', 'are', 'awesome']
ex1 + ex2 # concatenation
['Welcome', 'to', 'the', 'world', 'of', 'bees', 'they', 'are', 'awesome']
sent1.append("Some") #appending - adding a single element to a list
sent1
['Call', 'me', 'Ishmael', '.', 'Some']
# 2.2. Indexing Lists
text4[173] #the position of a word in the text = index
'awaken'
text4.index('awaken') #finding which is the index of a word
173
text5[16715:16735] #slicing - to access sublists as well, extracting manageable pieces of language from large texts
['U86', 'thats', 'why', 'something', 'like', 'gamefly', 'is', 'so', 'good', 'because', 'you', 'can', 'actually', 'play', 'a', 'full', 'game', 'without', 'buying', 'it']
text6[1600:1625]
['We', "'", 're', 'an', 'anarcho', '-', 'syndicalist', 'commune', '.', 'We', 'take', 'it', 'in', 'turns', 'to', 'act', 'as', 'a', 'sort', 'of', 'executive', 'officer', 'for', 'the', 'week']
sent = ['word1', 'word2', 'word3', 'word4', 'word5',
... 'word6', 'word7', 'word8', 'word9', 'word10']
sent[0]
'word1'
sent[8]
'word9'
sent[10] #runtime error
--------------------------------------------------------------------------- IndexError Traceback (most recent call last) /tmp/ipykernel_17310/3888686037.py in <module> ----> 1 sent[10] #runtime error IndexError: list index out of range
and it produces a Traceback message that shows the context of the error, followed by the name of the error, IndexError, and a brief explanation.
sent[5:8] #slicing
['word6', 'word7', 'word8']
sent[:3]
['word1', 'word2', 'word3']
text2[141525:] #seeing a sequence of the last words as a list
['among', 'the', 'merits', 'and', 'the', 'happiness', 'of', 'Elinor', 'and', 'Marianne', ',', 'let', 'it', 'not', 'be', 'ranked', 'as', 'the', 'least', 'considerable', ',', 'that', 'though', 'sisters', ',', 'and', 'living', 'almost', 'within', 'sight', 'of', 'each', 'other', ',', 'they', 'could', 'live', 'without', 'disagreement', 'between', 'themselves', ',', 'or', 'producing', 'coolness', 'between', 'their', 'husbands', '.', 'THE', 'END']
#2.3 Variables
Such lines have the form: variable = expression. Python will evaluate the expression, and save its result to the variable. This process is called assignment.
my_sent = ['Bravely', 'bold', 'Sir', 'Robin', ',', 'rode',
... 'forth', 'from', 'Camelot', '.']
noun_phrase = my_sent[1:4]
noun_phrase
['bold', 'Sir', 'Robin']
wOrDs = sorted(noun_phrase)
wOrDs
['Robin', 'Sir', 'bold']
sorted
<function sorted(iterable, /, *, key=None, reverse=False)>
vocab = set(text1)
vocab_size = len(vocab)
vocab_size
19317
Variables can hold intermediate steps of a computation, especially when this makes the code easier to follow.
#2.4 Strings
name = 'Monty' #we can assign a string to a variable
name[0] #we can index a string; a single string can be accessed letter by letter, too
'M'
name[:4] #we can slice a string; letters can be counted and indexed
'Mont'
len(name)
5
name * 2
'MontyMonty'
name + '!'
'Monty!'
' '.join(['Monty', 'Python']) #join the words of a list to make a single string
'Monty Python'
'Monty Python'.split() #split a string into a list
['Monty', 'Python']
3 Computing with Language: Simple Statistics
3.1 Frequency Distributions
saying = ['After', 'all', 'is', 'said', 'and', 'done',
... 'more', 'is', 'said', 'than', 'done']
tokens = set(saying)
tokens = sorted(tokens)
tokens[-2:]
['said', 'than']
sorted(tokens)
['After', 'all', 'and', 'done', 'is', 'more', 'said', 'than']
set(saying)
{'After', 'all', 'and', 'done', 'is', 'more', 'said', 'than'}
tokens[:2]
['After', 'all']
tokens[0]
'After'
# finding the most frequently used words in a text
# 1:
fdist1 = FreqDist(text1) #Moby Dick, name of the text - the argument
print(fdist1) #inspect the total number of words ("outcomes") that have been counted up
<FreqDist with 19317 samples and 260819 outcomes>
# 2:
fdist1.most_common(50) #gives us a list of the 50 most frequently occurring types in the text
[(',', 18713), ('the', 13721), ('.', 6862), ('of', 6536), ('and', 6024), ('a', 4569), ('to', 4542), (';', 4072), ('in', 3916), ('that', 2982), ("'", 2684), ('-', 2552), ('his', 2459), ('it', 2209), ('I', 2124), ('s', 1739), ('is', 1695), ('he', 1661), ('with', 1659), ('was', 1632), ('as', 1620), ('"', 1478), ('all', 1462), ('for', 1414), ('this', 1280), ('!', 1269), ('at', 1231), ('by', 1137), ('but', 1113), ('not', 1103), ('--', 1070), ('him', 1058), ('from', 1052), ('be', 1030), ('on', 1005), ('so', 918), ('whale', 906), ('one', 889), ('you', 841), ('had', 767), ('have', 760), ('there', 715), ('But', 705), ('or', 697), ('were', 680), ('now', 646), ('which', 640), ('?', 637), ('me', 627), ('like', 624)]
fdist1['whale'] #how many times a word is repeated?
906
fdist2 = FreqDist(text2)
print(fdist1)
<FreqDist with 19317 samples and 260819 outcomes>
fdist2.most_common(50)
[(',', 9397), ('to', 4063), ('.', 3975), ('the', 3861), ('of', 3565), ('and', 3350), ('her', 2436), ('a', 2043), ('I', 2004), ('in', 1904), ('was', 1846), ('it', 1568), ('"', 1506), (';', 1419), ('she', 1333), ('be', 1305), ('that', 1297), ('for', 1234), ('not', 1212), ('as', 1179), ('you', 1037), ('with', 971), ('had', 969), ('his', 941), ('he', 895), ("'", 883), ('have', 807), ('at', 806), ('by', 737), ('is', 728), ('."', 721), ('s', 700), ('Elinor', 684), ('on', 676), ('all', 642), ('him', 633), ('so', 617), ('but', 597), ('which', 592), ('could', 568), ('Marianne', 566), ('my', 551), ('Mrs', 530), ('from', 527), ('would', 507), ('very', 492), ('no', 488), ('their', 463), ('them', 462), ('--', 461)]
fdist2['Elinor'] #:D
684
#women writing - pronouns reference
fdist1.plot(50, cumulative=True)
<AxesSubplot:xlabel='Samples', ylabel='Cumulative Counts'>
#fdist1.hapaxes()
3.2 Fine-grained Selection of Words -finding the long words in a text
# 3.2. Fine-grained Selection of Words
V = set(text1)
long_words = [w for w in V if len(w) > 15]
sorted(long_words)
['CIRCUMNAVIGATION', 'Physiognomically', 'apprehensiveness', 'cannibalistically', 'characteristically', 'circumnavigating', 'circumnavigation', 'circumnavigations', 'comprehensiveness', 'hermaphroditical', 'indiscriminately', 'indispensableness', 'irresistibleness', 'physiognomically', 'preternaturalness', 'responsibilities', 'simultaneousness', 'subterraneousness', 'supernaturalness', 'superstitiousness', 'uncomfortableness', 'uncompromisedness', 'undiscriminating', 'uninterpenetratingly']
#set(text1)
V = set(text1)
long_words = [w for w in V if len(w) > 17] # Q: how does the variable for "word" work exactly?
sorted(long_words)
['characteristically', 'uninterpenetratingly']
V = set(text5)
long_words = [w for w in V if len(w) > 25] # Q: how does the variable for "word" work exactly?
sorted(long_words)
['!!!!!!!!!!!!!!!!!!!!!!!!!!!', '!!!!!!!!!!!!!!!!!!!!!!!!!!!!', '!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!', '((((((((((((((((((((((((((', '))))))))))))))))))))))))))))', ')))))))))))))))))))))))))))))))', '..............................', '//www.wunderground.com/cgi-bin/findweather/getForecast?query=95953#FIR', 'BAAAAALLLLLLLLIIIIIIINNNNNNNNNNN', 'HHEEYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY', 'Mooooooooooooooooooooooooooo', 'backfrontsidewaysandallaroundtheworld', 'cooooooooookiiiiiiiiiiiieeeeeeeeeeee', 'hahahahahahahahahahahahahahahaha', 'http://forums.talkcity.com/tc-adults/start ', 'huuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuugZ', 'llloooozzzzeeerrrrzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz', 'miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee', 'mikeeeeeeeeeeeeeeeeeeeeeeeeee', 'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee', 'oooooooooooooonnnnnnnnnnnneeeeeeeeeeeeeeesssssssss', 'raaaaaaaaaaaaaaaaaaaaaaaaaaaaa', 'weeeeeeeeeeeeeeeeeeeeeeeeed', 'wheeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee', 'woooooooooaaaahhhhhhhhhhhh', 'wooooooooooooohoooooooooooooooo']
fdist5 = FreqDist(text2)
sorted(w for w in set(text2) if len(w) > 10 and fdist2[w] > 7)
# all words from the chat corpus that are longer than 10 characters,
# that occur more than seven times
['Somersetshire', 'acknowledge', 'acknowledged', 'acquaintance', 'affectionate', 'approbation', 'astonishment', 'cheerfulness', 'circumstance', 'circumstances', 'comfortable', 'commendation', 'communication', 'consciousness', 'consequence', 'considerable', 'consideration', 'consolation', 'continuance', 'conversation', 'countenance', 'difficulties', 'disappointed', 'disappointment', 'disposition', 'earnestness', 'embarrassment', 'encouragement', 'encouraging', 'endeavouring', 'engagements', 'established', 'exceedingly', 'expectation', 'explanation', 'extraordinary', 'imagination', 'immediately', 'improvement', 'inclination', 'inconvenience', 'indifference', 'indifferent', 'indignation', 'information', 'intelligence', 'interesting', 'neighbourhood', 'observation', 'opportunity', 'particularly', 'particulars', 'performance', 'probability', 'recollection', 'recommended', 'remembrance', 'resemblance', 'respectable', 'satisfaction', 'sensibility', 'twelvemonth', 'uncomfortable', 'understanding', 'unfortunate', 'unfortunately', 'unnecessary']
# 3.3 Collocations and Bigrams
#bigrams()
list(bigrams(['more', 'is', 'said', 'than', 'done'])) #extracting from a text a list of word pairs
[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
bigrams(['more', 'is', 'said', 'than', 'done'])
<generator object bigrams at 0x9b565730>
text4.collocations()
#to find bigrams that occur more often than we would expect
#based on the frequency of the individual words
United States; fellow citizens; four years; years ago; Federal Government; General Government; American people; Vice President; God bless; Chief Justice; Old World; Almighty God; Fellow citizens; Chief Magistrate; every citizen; one another; fellow Americans; Indian tribes; public debt; foreign nations
text8.collocations()
would like; medium build; social drinker; quiet nights; non smoker; long term; age open; Would like; easy going; financially secure; fun times; similar interests; Age open; weekends away; poss rship; well presented; never married; single mum; permanent relationship; slim build
text5.collocations()
wanna chat; PART JOIN; MODE #14-19teens; JOIN PART; PART PART; cute.-ass MP3; MP3 player; JOIN JOIN; times .. .; ACTION watches; guys wanna; song lasts; last night; ACTION sits; -...)...- S.M.R.; Lime Player; Player 12%; dont know; lez gurls; long time
# 3.4 Counting Other Things
the distribution of word lengths in a text: FreqDist out of a long list of numbers, where each number is the length of the corresponding word in the text
[len(w) for w in text1] # list of the lengths of words in text1
fdist = FreqDist(len(w) for w in text1) # the number of times each of these occurs
print(fdist) # result
<FreqDist with 19 samples and 260819 outcomes>
fdist
FreqDist({3: 50223, 1: 47933, 4: 42345, 2: 38513, 5: 26597, 6: 17111, 7: 14399, 8: 9966, 9: 6428, 10: 3528, ...})
fdist.most_common()
[(3, 50223), (1, 47933), (4, 42345), (2, 38513), (5, 26597), (6, 17111), (7, 14399), (8, 9966), (9, 6428), (10, 3528), (11, 1873), (12, 1053), (13, 567), (14, 177), (15, 70), (16, 22), (17, 12), (18, 1), (20, 1)]
[len(w) for w in text2]
fdlist2 = FreqDist(len(w) for w in text2)
print(fdist2)
<FreqDist with 6833 samples and 141576 outcomes>
fdist2
FreqDist({',': 9397, 'to': 4063, '.': 3975, 'the': 3861, 'of': 3565, 'and': 3350, 'her': 2436, 'a': 2043, 'I': 2004, 'in': 1904, ...})
#fdist2.most_common()
fdist.max()
3
fdist[3]
50223
fdist.freq(3)
0.19255882431878046
sent7
['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.']
[w for w in sent7 if len(w) < 4]
[',', '61', 'old', ',', 'the', 'as', 'a', '29', '.']
[w for w in sent7 if len(w) >= 5]
['Pierre', 'Vinken', 'years', 'board', 'nonexecutive', 'director']
[w for w in sent7 if len(w) <= 4]
[',', '61', 'old', ',', 'will', 'join', 'the', 'as', 'a', 'Nov.', '29', '.']
[w for w in sent7 if len(w) == 4]
['will', 'join', 'Nov.']
[w for w in sent7 if len(w) != 4]
['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', '29', '.']
# all words that end with 'ableness':
sorted(w for w in set(text1) if w.endswith('ableness'))
['comfortableness', 'honourableness', 'immutableness', 'indispensableness', 'indomitableness', 'intolerableness', 'palpableness', 'reasonableness', 'uncomfortableness']
sorted(w for w in set(text2) if w.endswith('nity')) #same ending - nice for rhyme experiments
['Opportunity', 'Vanity', 'dignity', 'humanity', 'importunity', 'indignity', 'opportunity', 'solemnity', 'vanity', 'vicinity']
sorted(term for term in set(text4) if 'gnt' in term)
['Sovereignty', 'sovereignties', 'sovereignty']
sorted(term for term in set(text2) if 'sur' in term)
['absurd', 'absurdity', 'assurance', 'assurances', 'assure', 'assured', 'assuring', 'censure', 'censured', 'composure', 'disclosure', 'displeasure', 'enclosure', 'ensured', 'insured', 'insurmountable', 'leisure', 'leisurely', 'measure', 'measured', 'measures', 'measuring', 'pleasure', 'pleasures', 'sure', 'surely', 'surfaces', 'surpass', 'surpassed', 'surpassing', 'surplice', 'surprise', 'surprised', 'surprising', 'surrounded', 'surrounding', 'survey', 'surveying', 'survived', 'surviving', 'treasured']
#sorted(item for item in set(text6) if item.istitle())
sorted(item for item in set(sent7) if item.isdigit())
['29', '61']
sorted(w for w in set(text7) if '-' in w and 'index' in w)
['Stock-index', 'index-arbitrage', 'index-fund', 'index-options', 'index-related', 'stock-index']
# if there are both - and index in the word
sorted(wd for wd in set(text2) if wd.istitle() and len(wd) > 13)
['Disappointment']
# if word is longer than 10 letters and starts with a capital letter
sorted(w for w in set(sent7) if not w.islower())
[',', '.', '29', '61', 'Nov.', 'Pierre', 'Vinken']
# if the word contains cased characters and all are lowercase
sorted(t for t in set(text2) if 'cie' in t or 'cei' in t)
['ancient', 'ceiling', 'conceit', 'conceited', 'conceive', 'conscience', 'conscientious', 'conscientiously', 'deceitful', 'deceive', 'deceived', 'deceiving', 'deficiencies', 'deficiency', 'deficient', 'delicacies', 'excellencies', 'fancied', 'insufficiency', 'insufficient', 'legacies', 'perceive', 'perceived', 'perceiving', 'prescience', 'prophecies', 'receipt', 'receive', 'received', 'receiving', 'society', 'species', 'sufficient', 'sufficiently', 'undeceive', 'undeceiving']
# printing words with both cie and cei
#[len(w) for w in text1]
#[w.upper() for w in text1]
#python will perform the same operation for each element in a list
len(text1)
260819
len(set(text1))
19317
#(set(text1))
len(set(word.lower() for word in text1))
#removing double-counting words like This and this, which differ only in capitalization
17231
len(set(word.lower() for word in text1 if word.isalpha()))
#eliminate numbers and punctuation from the vocabulary count
#by filtering out any non-alphabetic items
16948
# if - a conditional expression, if statement; a control structure
# executing a block of code when a conditional expression is satisfied - e.g. "if" statement
word = 'cat'
if len(word) < 5: #conditional expression
print('word length is less than 5')
word length is less than 5
word = 'hello'
word = 'hell'
if len(word) < 5:
print('word length is less than 5')
word length is less than 5
word = 'cat'
if len(word) >= 5:
print('word length is greater than or equal to 5')
# the "for" loop; also a control structure
for word in ['Call', 'me', 'Ishmael', '.']:
print(word)
Call me Ishmael .
for l in ['a', 'b', 'c']:
print(l)
a b c
# combination between the loops and the conditions:
# we will loop through all the elements and will only print the ones that meet the criteria:
sent1 = ['Call', 'me', 'Ishmael', '.']
for zzy in sent1:
if zzy.endswith('l'):
print(zzy)
Call Ishmael
for token in sent1:
if token.islower():
print(token, 'is a lowecase word')
elif token.istitle():
print(token, 'is a titlecase word')
else:
print(token, 'is punctuation')
Call is a titlecase word me is a lowecase word Ishmael is a titlecase word . is punctuation
sorted(w for w in set(text2) if 'cie' in w or 'cei' in w)
['ancient', 'ceiling', 'conceit', 'conceited', 'conceive', 'conscience', 'conscientious', 'conscientiously', 'deceitful', 'deceive', 'deceived', 'deceiving', 'deficiencies', 'deficiency', 'deficient', 'delicacies', 'excellencies', 'fancied', 'insufficiency', 'insufficient', 'legacies', 'perceive', 'perceived', 'perceiving', 'prescience', 'prophecies', 'receipt', 'receive', 'received', 'receiving', 'society', 'species', 'sufficient', 'sufficiently', 'undeceive', 'undeceiving']
tricky = sorted(w for w in set(text2) if 'cie' in w or 'cei' in w) #sorting out words that include
for word in tricky: #looping through all the words
print(word, end = ' ') #printing only the ones that meet the condition
ancient ceiling conceit conceited conceive conscience conscientious conscientiously deceitful deceive deceived deceiving deficiencies deficiency deficient delicacies excellencies fancied insufficiency insufficient legacies perceive perceived perceiving prescience prophecies receipt receive received receiving society species sufficient sufficiently undeceive undeceiving
nltk.chat.chatbots()
--------------------------------------------------------------------------- NameError Traceback (most recent call last) /tmp/ipykernel_17310/3022652667.py in <module> ----> 1 nltk.chat.chatbots() NameError: name 'nltk' is not defined
import nltk
#nltk.chat.chatbots()
nltk.chat.eliza.demo()
sorted(set(t))
#sorted(set(text2))
#sorted(set(text1))
#[len(x) for x in text2]
13 * (12 + 64)
988
The Python multiplication operation can be applied to lists. What happens when you type ['Monty', 'Python'] 20, or 3 sent1?
sent2 = ['Alex', 'Oleg']
3 * sent2
['Alex', 'Oleg', 'Alex', 'Oleg', 'Alex', 'Oleg']
Review 1 on computing with language. How many words are there in text2? How many distinct words are there?
len(text2) #len of the whole text
141576
len(set(text2)) #how many words are there (length of sorted words)
6833
Compare the lexical diversity scores for humor and romance fiction in 1.1. Which genre is more lexically diverse?
def lexical_diversity(text):
return len(set(text)) / len(text)
lexical_diversity(text2)
0.04826383002768831
text5.collocations()
wanna chat; PART JOIN; MODE #14-19teens; JOIN PART; PART PART; cute.-ass MP3; MP3 player; JOIN JOIN; times .. .; ACTION watches; guys wanna; song lasts; last night; ACTION sits; -...)...- S.M.R.; Lime Player; Player 12%; dont know; lez gurls; long time
Find the collocations in text5.
Consider the following Python expression: len(set(text4)). State the purpose of this expression. Describe the two steps involved in performing this computation.
#step 1:
#set(text4) # having a list with all the characters in the text
#step 2:
len(set(text4)) # showing the len of the characters list
9913
Review 2 on lists and strings.
Define a string and assign it to a variable, e.g., my_string = 'My String' (but put something more interesting in the string). Print the contents of this variable in two ways, first by simply typing the variable name and pressing enter, then by using the print statement. Try adding the string to itself using my_string + my_string, or multiplying it by a number, e.g., my_string * 3. Notice that the strings are joined together without any spaces. How could you fix this?
my_string = 'Beautiful world'
print(my_string)
Beautiful world
my_string
'Beautiful world'
(my_string) * 2
'Beautiful worldBeautiful world'
print(((my_string) + ' ') * 2)
Beautiful world Beautiful world
((my_string) + ' ') * 2
'Beautiful world Beautiful world '
Define a variable my_sent to be a list of words, using the syntax my_sent = ["My", "sent"] (but with your own words, or a favorite saying).
Use ' '.join(my_sent) to convert this into a string. Use split() to split the string back into the list form you had to start with.
my_sent = ["Welcome", "to", "the", "world", "of", "bees"]
' '.join(my_sent)
'Welcome to the world of bees'
my_sent
['Welcome', 'to', 'the', 'world', 'of', 'bees']
my_sent = ' '.join(my_sent) #converting it into a string
my_sent # x4 :)
'W e l c o m e t o t h e w o r l d o f b e e s'
my_sent.split
<function str.split(sep=None, maxsplit=-1)>
my_sent
'W e l c o m e t o t h e w o r l d o f b e e s'
x = my_sent.split()
x
['W', 'e', 'l', 'c', 'o', 'm', 'e', 't', 'o', 't', 'h', 'e', 'w', 'o', 'r', 'l', 'd', 'o', 'f', 'b', 'e', 'e', 's']
my_sent = my_sent.split()
my_sent
['W', 'e', 'l', 'c', 'o', 'm', 'e', 't', 'o', 't', 'h', 'e', 'w', 'o', 'r', 'l', 'd', 'o', 'f', 'b', 'e', 'e', 's']
ph1 = "I don't know what I am looking for"
ph2 = "but surely I am looking for something!"
len(ph1)
34
len(ph2)
38
len(ph1 + ph2) # sum of the two phrases and then counting the total?
72
len(ph1) + len(ph2) # len 1 sum + then len 2 sum?
72
print((ph1 + ', ') + ph2)
I don't know what I am looking for, but surely I am looking for something!
sent1[2][2]
'h'
sent1
['Call', 'me', 'Ishmael', '.']
sent1[1]
'me'
sent3
['In', 'the', 'beginning', 'God', 'created', 'the', 'heaven', 'and', 'the', 'earth', '.']
sent3[1]
'the'
sent3.index('the')
1
sent3[1]
'the'
sent3[6]
'heaven'
☼ Review the discussion of conditionals in 4. Find all words in the Chat Corpus (text5) starting with the letter b. Show them in alphabetical order.
sorted(w for w in set(text5) if w.startswith('b'))
['b', 'b-day', 'b/c', 'b4', 'babay', 'babble', 'babblein', 'babe', 'babes', 'babi', 'babies', 'babiess', 'baby', 'babycakeses', 'bachelorette', 'back', 'backatchya', 'backfrontsidewaysandallaroundtheworld', 'backroom', 'backup', 'bacl', 'bad', 'bag', 'bagel', 'bagels', 'bahahahaa', 'bak', 'baked', 'balad', 'balance', 'balck', 'ball', 'ballin', 'balls', 'ban', 'band', 'bandito', 'bandsaw', 'banjoes', 'banned', 'baord', 'bar', 'barbie', 'bare', 'barely', 'bares', 'barfights', 'barks', 'barn', 'barrel', 'base', 'bases', 'basically', 'basket', 'battery', 'bay', 'bbbbbyyyyyyyeeeeeeeee', 'bbiam', 'bbl', 'bbs', 'bc', 'be', 'beach', 'beachhhh', 'beam', 'beams', 'beanbag', 'beans', 'bear', 'bears', 'beat', 'beaten', 'beatles', 'beats', 'beattles', 'beautiful', 'because', 'beckley', 'become', 'bed', 'bedford', 'bedroom', 'beeeeehave', 'beeehave', 'been', 'beer', 'before', 'beg', 'begin', 'behave', 'behind', 'bein', 'being', 'beleive', 'believe', 'belive', 'bell', 'belly', 'belong', 'belongings', 'ben', 'bend', 'benz', 'bes', 'beside', 'besides', 'best', 'bet', 'betrayal', 'betta', 'better', 'between', 'beuty', 'bf', 'bi', 'biatch', 'bible', 'biebsa', 'bied', 'big', 'bigest', 'biggest', 'biiiatch', 'bike', 'bikes', 'bikini', 'bio', 'bird', 'birfday', 'birthday', 'bisexual', 'bishes', 'bit', 'bitch', 'bitches', 'bitdh', 'bite', 'bites', 'biyatch', 'biz', 'bj', 'black', 'blade', 'blah', 'blank', 'blankie', 'blazed', 'bleach', 'blech', 'bless', 'blessings', 'blew', 'blind', 'blinks', 'bliss', 'blocking', 'bloe', 'blood', 'blooded', 'bloody', 'blow', 'blowing', 'blowjob', 'blowup', 'blue', 'blueberry', 'bluer', 'blues', 'blunt', 'board', 'bob', 'bodies', 'body', 'boed', 'boght', 'boi', 'boing', 'boinked', 'bois', 'bomb', 'bone', 'boned', 'bones', 'bong', 'boning', 'bonus', 'boo', 'booboo', 'boobs', 'book', 'boom', 'boooooooooooglyyyyyy', 'boost', 'boot', 'bootay', 'booted', 'boots', 'booty', 'border', 'borderline', 'bored', 'boredom', 'boring', 'born', 'born-again', 'bosom', 'boss', 'bossy', 'bot', 'both', 'bother', 'bothering', 'bottle', 'bought', 'bounced', 'bouncer', 'bouncers', 'bound', 'bout', 'bouts', 'bow', 'bowl', 'box', 'boy', 'boyfriend', 'boys', 'bra', 'brad', 'brady', 'brain', 'brakes', 'brass', 'brat', 'brb', 'brbbb', 'bread', 'break', 'breaks', 'breath', 'breathe', 'bred', 'breeding', 'bright', 'brightened', 'bring', 'brings', 'bro', 'broke', 'brooklyn', 'brother', 'brothers', 'brought', 'brown', 'brrrrrrr', 'bruises', 'brunswick', 'brwn', 'btw', 'bucks', 'buddyyyyyy', 'buff', 'buffalo', 'bug', 'bugs', 'buh', 'build', 'builds', 'built', 'bull', 'bulls', 'bum', 'bumber', 'bummer', 'bumped', 'bumper', 'bunch', 'bunny', 'burger', 'burito', 'burned', 'burns', 'burp', 'burpin', 'burps', 'burried', 'burryed', 'bus', 'buses', 'bust', 'busted', 'busy', 'but', 'butt', 'butter', 'butterscotch', 'button', 'buttons', 'buy', 'buying', 'bwahahahahahahahahahaha', 'by', 'byb', 'bye', 'byeee', 'byeeee', 'byeeeeeeee', 'byeeeeeeeeeeeee', 'byes']
☼ Type the expression list(range(10)) at the interpreter prompt. Now try list(range(10, 20)), list(range(10, 20, 2)), and list(range(20, 10, -2)). We will see a variety of uses for this built-in function in later chapters.
list(range(10))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
list(range(10, 20))
[10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
list(range(10, 20, 2))
[10, 12, 14, 16, 18]
list(range(20, 10, -2))
[20, 18, 16, 14, 12]
◑ Use text9.index() to find the index of the word sunset. You'll need to insert this word as an argument between the parentheses. By a process of trial and error, find the slice for the complete sentence that contains this word.
text9.index('sunset')
629
text9[621:644]
['THE', 'suburb', 'of', 'Saffron', 'Park', 'lay', 'on', 'the', 'sunset', 'side', 'of', 'London', ',', 'as', 'red', 'and', 'ragged', 'as', 'a', 'cloud', 'of', 'sunset', '.']
◑ Using list addition, and the set and sorted operations, compute the vocabulary of the sentences sent1 ... sent8.
sentall = sent1 + sent2 + sent3 + sent4 + sent5 + sent6 + sent7 + sent8
sorted(set(sentall))
['!', ',', '-', '.', '1', '25', '29', '61', ':', 'ARTHUR', 'Alex', 'Call', 'Citizens', 'Fellow', 'God', 'House', 'I', 'In', 'Ishmael', 'JOIN', 'KING', 'MALE', 'Nov.', 'Oleg', 'PMing', 'Pierre', 'Representatives', 'SCENE', 'SEXY', 'Senate', 'Vinken', 'Whoa', '[', ']', 'a', 'and', 'as', 'attrac', 'beginning', 'board', 'clop', 'created', 'director', 'discreet', 'earth', 'encounters', 'for', 'have', 'heaven', 'join', 'lady', 'lol', 'me', 'nonexecutive', 'of', 'old', 'older', 'people', 'problem', 'seeks', 'single', 'the', 'there', 'to', 'will', 'wind', 'with', 'years']
◑ What is the difference between the following two lines? Which one will give a larger value? Will this be the case for other texts?
#sorted(set(w.lower() for w in text1))
#sorted(w.lower() for w in set(text1))
#set(w.lower() for w in text1)
What is the difference between the following two tests: w.isupper() and not w.islower()?
sorted(set(w.upper() for w in sent1))
['.', 'CALL', 'ISHMAEL', 'ME']
sorted(set(w.lower() for w in sent1))
['.', 'call', 'ishmael', 'me']
set(w.isupper() for w in sent1)
{False}
set(not w.islower() for w in sent1)
{False, True}
sent1
['Call', 'me', 'Ishmael', '.']
◑ Write the slice expression that extracts the last two words of text2.
text2[-2] + ' ' + text2[-1]
'THE END'