NLP

Posted Nov 6, 2023

By ainotes 7 min read

Q&A

Read in the texts of the State of the Union addresses, using the state_union corpus reader. Count occurrences of men, women, and people in each document. What has happened to the usage of these words over time?

```python from collections import Counter # out = {'men':{},'women':{},'people':{}} for ids in nltk.corpus.state_union.fileids(): counter = Counter(nltk.corpus.state_union.words(fileids=ids)) out['men'][ids[:4]] = counter['men'] out['women'][ids[:4]] = counter['women'] out['people'][ids[:4]] = counter['people'] plt.figure(figsize=(20, 6)) plt.plot(out['men'].keys(), out['men'].values(), label='men') plt.plot(out['women'].keys(), out['women'].values(), label='women') plt.plot(out['people'].keys(), out['people'].values(), label='people') plt.xticks(rotation=90) plt.show() ```

Investigate the holonym-meronym relations for some nouns. Remember that there are three kinds of holonym-meronym relation, so you need to use: member_meronyms(), part_meronyms(), substance_meronyms(), member_holonyms(), part_holonyms(), and substance_holonyms().

- **Member Meronym**: A member of a group or collection. For example, 'tree' is a member of 'forest'.

- **Part Meronym**: A part or component of something. For example, 'wheel' is a part of 'car'.

- **Substance Meronym**: A substance or material that makes up something. For example, 'steel' is a substance of 'bridge'.

- **Member Holonym**: The group or collection to which something belongs. For example, 'forest' is the group to which 'tree' belongs.

- **Part Holonym**: The whole entity that includes a part. For example, 'car' is the whole entity that includes 'wheel'.

- **Substance Holonym**: The whole entity that is made up of a substance. For example, 'bridge' is the whole entity made up of 'steel'.

```python def print_meronym_holonym_relations(word): synsets = wn.synsets(word) if not synsets: print(f"No synsets found for '{word}'") return synset = synsets[0] # Consider the first synset for simplicity print(f"\nInvestigating '{word}' - {synset.definition()}") # Member-Meronyms member_meronyms = synset.member_meronyms() print("\nMember Meronyms:") for meronym in member_meronyms: print(f" - {meronym.name()}: {meronym.definition()}") # Part-Meronyms part_meronyms = synset.part_meronyms() print("\nPart Meronyms:") for meronym in part_meronyms: print(f" - {meronym.name()}: {meronym.definition()}") # Substance-Meronyms substance_meronyms = synset.substance_meronyms() print("\nSubstance Meronyms:") for meronym in substance_meronyms: print(f" - {meronym.name()}: {meronym.definition()}") # Member-Holonyms member_holonyms = synset.member_holonyms() print("\nMember Holonyms:") for holonym in member_holonyms: print(f" - {holonym.name()}: {holonym.definition()}") # Part-Holonyms part_holonyms = synset.part_holonyms() print("\nPart Holonyms:") for holonym in part_holonyms: print(f" - {holonym.name()}: {holonym.definition()}") # Substance-Holonyms substance_holonyms = synset.substance_holonyms() print("\nSubstance Holonyms:") for holonym in substance_holonyms: print(f" - {holonym.name()}: {holonym.definition()}") # Investigate the relations for selected nouns words_to_investigate = ['tree', 'car', 'water'] for word in words_to_investigate: print_meronym_holonym_relations(word) ```

# Zipf's Law: Let f(w) be the frequency of a word w in free text. Suppose that all the words of a text are ranked according to their frequency, with the most frequent word first. Zipf's law states that the frequency of a word type is inversely proportional to its rank (i.e. f × r = k, for some constant k). For example, the 50th most common word type should occur three times as frequently as the 150th most common word type. # Write a function to process a large text and plot word frequency against word rank using pylab.plot. Do you confirm Zipf's law? (Hint: it helps to use a logarithmic scale). What is going on at the extreme ends of the plotted line? # Generate random text, e.g., using random.choice("abcdefg "), taking care to include the space character. You will need to import random first. Use the string concatenation operator to accumulate characters into a (very) long string. Then tokenize this string, and generate the Zipf plot as before, and compare the two plots. What do you make of Zipf's Law in the light of this?

```python from nltk.corpus import gutenberg words = nltk.word_tokenize(gutenberg.raw('melville-moby_dick.txt')) # Calculate word frequencies word_freqs = Counter(words) # Sort words by frequency sorted_word_freqs = sorted(word_freqs.items(), key=lambda x: x[1], reverse=True) # Extract ranks and frequencies ranks = range(1, len(sorted_word_freqs) + 1) frequencies = [freq for word, freq in sorted_word_freqs] # Plot word frequency against rank plt.figure(figsize=(12, 6)) plt.loglog(ranks, frequencies, marker=".") plt.title("Word Frequency vs Rank in Moby Dick") plt.xlabel("Rank") plt.ylabel("Frequency") plt.grid(True) plt.show() # Confirming Zipf's Law: f × r = k # Check the constant k for the first few words k_values = [freq * rank for freq, rank in zip(frequencies[:10], ranks[:10])] print("k values for the first 10 words:", k_values) print("Mean k value:", sum(k_values) / len(k_values)) ``` ```python import random # Generate random text random_text = ''.join(random.choices("abcdefg ", k=1000000)) # Tokenize the random text random_words = nltk.word_tokenize(random_text) # Calculate word frequencies random_word_freqs = Counter(random_words) # Sort words by frequency sorted_random_word_freqs = sorted(random_word_freqs.items(), key=lambda x: x[1], reverse=True) # Extract ranks and frequencies random_ranks = range(1, len(sorted_random_word_freqs) + 1) random_frequencies = [freq for word, freq in sorted_random_word_freqs] # Plot word frequency against rank for random text plt.figure(figsize=(12, 6)) plt.loglog(random_ranks, random_frequencies, marker=".") plt.title("Word Frequency vs Rank in Random Text") plt.xlabel("Rank") plt.ylabel("Frequency") plt.grid(True) plt.show() ```

```python from nltk.corpus import names import random from nltk.classify import apply_features import string def gender_features(input): features = {} features["first_letter"] = input[0] features["last_letter"] = input[-1] for letter in string.ascii_lowercase: features[f"count({letter})"] = input.lower().count(letter) features[f"has({letter})"] = (letter in input.lower()) return features names_ = ([(male,'male') for male in names.words('male.txt')]+ [(female,'female') for female in names.words('female.txt')]) random.shuffle(names_) # features_set = [(gender_features(n),g) for (n,g) in names_ ] print(f"---\n\n") train_set = apply_features(gender_features, names_[500:]) test_set = apply_features(gender_features, names_[:500]) classifier = nltk.NaiveBayesClassifier.train(train_set) classifier.classify(gender_features('o')) nltk.classify.accuracy(classifier, test_set) classifier.show_most_informative_features() ``` ```python import random import string from nltk.corpus import names from collections import defaultdict, Counter # Function to extract gender features from a name def gender_features(input): features = {} features["first_letter"] = input[0].lower() features["last_letter"] = input[-1].lower() for letter in string.ascii_lowercase: features[f"count({letter})"] = input.lower().count(letter) features[f"has({letter})"] = (letter in input.lower()) return features # Load and shuffle the names dataset names_ = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')]) random.shuffle(names_) # Split the data into training and testing sets train_names = names_[500:] test_names = names_[:500] # Extract features for the training and testing sets train_set = [(gender_features(name), gender) for (name, gender) in train_names] test_set = [(gender_features(name), gender) for (name, gender) in test_names] # Function to train the Naive Bayes classifier def train_naive_bayes(train_set): feature_freq = defaultdict(lambda: defaultdict(Counter)) label_freq = Counter() total_samples = len(train_set) for features, label in train_set: label_freq[label] += 1 for feature, value in features.items(): feature_freq[feature][value][label] += 1 return feature_freq, label_freq, total_samples # Train the Naive Bayes classifier feature_freq, label_freq, total_samples = train_naive_bayes(train_set) # Function to calculate the probability of a label given features def predict(features): label_probs = {} for label in label_freq: label_prob = label_freq[label] / total_samples for feature, value in features.items(): if feature in feature_freq and value in feature_freq[feature]: label_prob *= (feature_freq[feature][value][label] + 1) / (label_freq[label] + len(feature_freq[feature])) else: label_prob *= 1 / (label_freq[label] + len(feature_freq[feature])) label_probs[label] = label_prob return max(label_probs, key=label_probs.get) # Evaluate the classifier correct = 0 for features, actual_label in test_set: predicted_label = predict(features) if predicted_label == actual_label: correct += 1 accuracy = correct / len(test_set) print(f'Accuracy: {accuracy:.4f}') m ```

Notes

nltk

Lexical Resources

A lexicon, or lexical resource, is a collection of words and/or phrases along with associated information such as part of speech and sense definitions. Lexical resources are secondary to texts, and are usually created and enriched with the help of texts. For example, if we have defined a text my_text, then vocab = sorted(set(my_text)) builds the vocabulary of my_text, while word_freq = FreqDist(my_text) counts the frequency of each word in the text. Both of vocab and word_freq are simple lexical resources. Similarly, a concordance like the one we saw in 1.1 gives us information about word usage that might help in the preparation of a dictionary. Standard terminology for lexicons is illustrated in 2.7. A lexical entry consists of a headword (also known as a lemma) along with additional information such as the part of speech and the sense definition. Two distinct words having the same spelling are called homonyms.

WordNet

WordNet is a semantically-oriented dictionary of English, similar to a traditional thesaurus but with a richer structure. NLTK includes the English WordNet, with 155,287 words and 117,659 synonym sets.

Reference

Natural Language Processing with Python — Analyzing Text with the Natural Language Toolkit Steven Bird, Ewan Klein, and Edward Loper

NLP

This post is licensed under CC BY 4.0 by the author.