Regular Expression
Python Regular Expression Cheat Sheet
Basics
.
: Matches any character except a newline.^
: Anchors the regex at the start of the string.$
: Anchors the regex at the end of the string.[]
: Matches any one of the enclosed characters.|
: Acts like a logical OR.
Character Classes
\d
: Matches any digit (0-9).\D
: Matches any non-digit.\w
: Matches any alphanumeric character (word character).\W
: Matches any non-alphanumeric character.\s
: Matches any whitespace character.\S
: Matches any non-whitespace character.
Quantifiers
*
: Matches 0 or more occurrences.+
: Matches 1 or more occurrences.?
: Matches 0 or 1 occurrence.{n}
: Matches exactly n occurrences.{n,}
: Matches n or more occurrences.{n,m}
: Matches between n and m occurrences.
Anchors
\b
: Matches a word boundary.\B
: Matches a non-word boundary.^
: Matches the start of a string.$
: Matches the end of a string.
Groups and Capturing
()
: Groups patterns together.(?:...)
: Non-capturing group.
Character Escapes
\\
: Escapes a special character.\n
,\t
, etc.: Newline, tab, etc.
Examples
\d{3}-\d{2}-\d{4}
: Matches a Social Security Number.^\w+@\w+\.\w+$
: Matches a basic email address.
Flags
re.IGNORECASE
orre.I
: Case-insensitive matching.re.MULTILINE
orre.M
: ^ and $ match the start/end of each line.
Methods
re.search(pattern, string)
: Searches for the first occurrence of the pattern.re.match(pattern, string)
: Matches the pattern only at the beginning of the string.re.fullmatch(pattern, string)
: Matches the entire string against the pattern.re.findall(pattern, string)
: Returns a list of all occurrences of the pattern.re.finditer(pattern, string)
: Returns an iterator of match objects for all occurrences.re.sub(pattern, replacement, string)
: Replaces occurrences of the pattern with the replacement.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import re
# Example 1: Check if a string contains a number
result = bool(re.search(r'\d', 'Hello123World'))
print(result) # Output: True
# Example 2: Extract all words from a string
words = re.findall(r'\b\w+\b', 'This is a sample sentence.')
print(words) # Output: ['This', 'is', 'a', 'sample', 'sentence']
# Example 3: Replace all vowels with '*'
new_string = re.sub(r'[aeiou]', '*', 'Hello World')
print(new_string) # Output: H*ll* W*rld
# Example 4: Extract domain from an email address
domain = re.search(r'@\w+\.\w+', 'user@example.com').group()
print(domain) # Output: @example.com
Split a paragraph into sentences
1
2
3
4
input = input()
split = input.split('.')
for i in split:
print(f"{(i).strip()}.")
Perplexity Calculation
Perplexity is a measure of how well a probability distribution or probability model predicts a sample. It is often used in the context of language modeling.
The formula for perplexity in the case of a unigram model is:
\[\text{Perplexity} = 2^{H(p)}\] \[H(p) = - \sum\_{x \in \mathcal{X}} p(x) \log_2 p(x)\]where $ H(p) $ is the cross-entropy of the unigram model.
This post is licensed under CC BY 4.0 by the author.