Since Python 3, the
str type uses Unicode representation. Unicode strings can take up to 4 bytes per character depending on the encoding, which sometimes can be expensive from a memory perspective.
To reduce memory consumption and improve performance, Python uses three kinds of internal representations for Unicode strings:
- 1 byte per char (Latin-1 encoding)
- 2 bytes per char (UCS-2 encoding)
- 4 bytes per char (UCS-4 encoding)
When programming in Python all strings behave the same, and most of the time we don’t notice any difference. However, the difference can be very remarkable and sometimes unexpected when working with large amounts of text.
To see the difference in internal representations, we can use the
sys.getsizeof function, which returns the size of an object in bytes:
>>> import sys >>> string = 'hello' >>> sys.getsizeof(string) 54 >>> # 1-byte encoding >>>