Skip to content

Latest commit

 

History

History
157 lines (130 loc) · 6.9 KB

string.md

File metadata and controls

157 lines (130 loc) · 6.9 KB

Python Character and Byte String Handling

Python has immutable strings of Unicode code points, str, and 8-bit bytes, bytes, both of which are [sequences] as well as having further specialized methods. There's no separate char type; s[0] produces a str or bytes of length 1.

Other binary sequence types include:

  • bytearray: Mutable counterpart to bytes. No string literal constructor but otherwise all the same methods plus mutators.
  • memoryview: Memory buffers to access internal data of objects supporting the (C-level) buffer protocol.

Constructors

  • str(obj='')
  • str(obj=b'', encoding='utf-8', errors='strict')
  • bytes(10): Zero-filled string.
  • bytes(range(20)): From iterable of integers 0 ≤ i < 256.
  • bytes(b'abc'): Copy of binary data via buffer protocol
  • bytes.fromhex('2Ef0 F1F2'): ASCII hex representation, skipping whitespace

Literals are quoted with single (') or double (") quotes; each allows the other in its string. Triple-quoted strings (''' or """); may span multiple lines. Adjacent string literals are concatenated into a single string.

String literals may be prefixed with case-insensitive one-character prefixes to change their interpretation:

  • b: Produce a bytes instead of a str. Only ASCII chars (codepoints < 128) and backslash escape sequences allowed.
  • r: Raw string or bytes; backslashes are interpreted literally. (Not usable with u.)
  • u: Unicode literal. Does nothing in Python ≥3.3; in Python 2, where str is the equivalant of bytes, reads string literal as Unicode instead.
  • f: (≥3.6) Formatted string literal. Cannot be combined with b or u.

More, including escape code list, at String and Bytes literals.

Methods

All methods below apply to both character and byte strings (str and bytes) unless otherwise indicated. Methods that assume chars (e.g., capitalize) assume ASCII in bytestrings. Methods available on immutable objects always return a new copy, even when called on a mutable object (e.g., bytearray.replace()).

Common Sequence Operations:

  • t [not] in s: Subsequence test, e.g., 'bar' in 'foobarbaz' is True
  • s + t: Concatenation returning new object. For better efficiency, use ''.join(s, t, ...) or write to io.StringIO.

Encoding:

  • decode(encoding='utf-8', errors='strict'): Returns str decoded from bytes read as encoding. errors may be strict (raises UnicodeError), ignore, replace, etc.; see codec error handlers.
  • encode(encoding='utf-8', errors='strict'): Return bytes object encoded from str.

Character Class Predicates (str only; all chars must match and len ≥ 1):

  • isprintable(): Includes space but not other whitespace; true if empty as well
  • isspace(): Whitespace
  • isalnum(): Is alpha, decimal, digit or numeric
  • isalpha(): Unicode 'Letter' (not Unicode 'Alphabetic')
  • isdecimal(): Chars form numbers in base 10
  • isnumeric(): Includes e.g., fractions
  • isdigit(): Includes non-decimal, e.g., superscripts, Kharosthi numbers
  • istitle(): Cased chars after nonchars upper, all else lower
  • isupper(), islower(): Must include a cased character
  • isidentifier(): According to Python language def; also see keyword.iskeyword()

String Predicates (all take optional start and end indexes):

  • s₁ in s₂
  • startswith(s), endswith(s)
  • count(s): Count of non-overlapping s

Indexing (all take optional start and end indexes):

  • find(s), rfind(s): Returns lowest/highest index of s
  • index(), rindex(): As find but raise ValueError when not found

Modification:

  • lstrip(cs), rstrip(cs), strip(cs): Remove leading/trailing/both chars of set made from string cs, default whitespace
  • replace(old, new[, count]): Replace substring old

Case modification:

  • upper(), lower()
  • swapcase(): Not necessarily reversable
  • capitalize(): First char capitalized; rest lowered
  • title(): All chars after non-chars uppered; can produce weird results
  • casefold(): (≥3.3) More aggressive "lower casing" as per Unicode 3.13.

Padding:

  • expandtabs(tabsize=8): Column 0 at start of string
  • center(width, fillchar=' ')
  • ljust(width, fillchar=' ')
  • rjust(width, fillchar=' ')
  • zfill(width): Pad with 0 between sign and digits; sign included in width

Splitting:

  • partition(sep), rpartition(sep):
    Return a 3-tuple of (pre, sep, post) or (str, '', '') if sep not found
  • split(sep=None, max=-1), rsplit():
    • sep=None separates with runs of consecutive whitespace; leading/trailing whitespace is removed
    • Consecutive non-None _sep_s delimit empty strings
    • Returns unlimited if -1, or no more than max+1 elements
  • splitlines(keepends=False): Splits on \r, \n, \r\n, \v, \f, \x1c, \x1d, \x1e (file/group/record separator), \x85 (next line C1), \u2028 (line sep), \u2029 (para sep)

Other:

  • join(iterable): Concatenation of iterable separated by string providing this method.
  • maketrans(x, y=None, z=None): Make translation table
    • 1 arg: dict mapping ints of Unicode code points or chars to Unicode code points, chars, strings or None
    • 2 args: strings of equal length
    • 3 args: as 2, but 3rd arg is chars to delete
  • translate(table): Chars translated through maketrans table

Formatting

  • f'...', F'...': (≥3.6) Formatted string literals or f-strings
  • format(*args, **kwargs): See format string syntax
  • format_map(mapping): mapping is used directly and not copied to a dict (useful for dict subclasses)
  • s % values: Not recommended. See printf-string and printf-bytes for more info.

I/O