Introduction to Python Programming. Section 4. Working with Text Strings

4 Working with Text Strings

4.1 Objectives

In this section you will learn how to:

  • Define text strings and display them with the print function.
  • Use the repr function to identify and remove trailing spaces.
  • Measure the length of text strings.
  • Use single and double quotes, and the newline character \n.
  • Concatenate and repeat text strings.
  • Access characters by their indices.
  • Parse text strings one character at a time,
  • Slice, copy, and reverse text strings.
  • Check for substrings and count their occurrences.
  • Search for and replace substrings.
  • Decompose a text string into a list of words.

4.2 Why learn about text strings?

By a text string or just string we mean a sequence of characters (text). More than 80% of work computers do is processing text. Therefore, becoming fluent in working with text strings is an absolute must for every programmer. Text strings can be used to make outputs of computer programs more informative, but their primary application is to represent data in various databases:

  • registry of drivers in the DMV,
  • database of customers in a company,
  • database of goods in a warehouse,
  • to store your tweets on Twitter,
  • to store your blog posts,
  • in phone books,
  • in dictionaries, etc.

Before doing any higher-level programming, one needs to understand text strings very well, and be aware of various things that can be done with them.

4.3 Defining text strings

When defining a text string, it does not matter if the sequence of characters is enclosed in single or double quotes:

  ’Talk is cheap. Show me the code. (Linus Torvalds)’
  ~Life is not fair. Get used to it. (Bill Gates)~

The enclosing quotes are not part of the text string. Sometimes one might want to make them part of a text string, such as when including direct speech. This can be a bit tricky - we are going to talk about it in Subsection 4.14.

Most of the time, text strings in computer programs are shorter, and sometimes then can even be empty. All these are admissible text strings:

  ’’
   ’
    ’

Here only the first text string is empty (=contains no characters). The other two contain empty space characters ’ ’ which makes them non-empty. They are all different text strings for Python.

4.4 Storing text strings in variables

Just typing a raw text string as above would not be very useful - text strings need to be stored, processed and/or displayed. To store a text string in a variable named (for example) mytext, type

  mytext = ’I want to make a ding in the universe. (Steve Jobs)’

When a variable stores a text string, we call it a text string variable.

4.5 Measuring the length of text strings

Python has a built-in function len to measure the length of text strings. It works on raw text strings,

  print(len(’Talk is cheap. Show me the code.’))

  32

as well as on text string variables:

  txt = ’Talk is cheap. Show me the code.’
  print(len(txt))

  32

4.6 Python is case-sensitive

We will talk about variables in much more detail in Section 5, but for now let’s mention that mytext, Mytext, MyText and MYTEXT are four different variables. Python is case-sensitive, meaning that it distinguishes between lowercase and uppercase characters. Also, ’Hello!’ and ’hello!’ are different text strings in Python.

4.7 The print function

Python can display raw text strings, text string variables, numbers and other things using the function print. Here is a simple example displaying just one text string:

  print(’I am a text string.’)

  I am a text string.

Notice that the enclosing quotes are not displayed. The next example displays a text string and a text string variable:

  name = ’Jennifer’
  print(’Her name was’, name)

  Her name was Jennifer

Notice that the displayed text strings are separated with one empty space ’ ’. And here is one more example which displays a text string and a number:

  name = ’Jennifer’
  print(’The final answer:’, 42)

  The final answer: 42

Here the print function automatically converted the number 42 into a text string ’42’. Then an empty space was inserted between the two displayed text strings again.

Separating items with one empty space is the default behavior of the print function. In a moment we are going to show you how to change it, but first let’s introduce a super-useful function help.

4.8 Function help

Python has a built-in function help which can be used to obtain more information about other built-in functions, operators, classes, and other things. Let’s use it to learn more about the function print:

  help(print)

Output:

  Help on built-in function print in module builtins:
  
  print(...)
      print(value, ..., sep=’ ’, end=’\n’, file=sys.stdout)
  
      Prints the values to a stream, or to sys.stdout by default.
      Optional keyword arguments:
      file: a file-like object (stream); defaults to the current
      sys.stdout.
      sep:  string inserted between values, default a space.
      end:  string appended after the last value, default a
      newline.

This explains a number of things! The default separator of items sep is one empty space ’ ’. Further, the print function adds by default the newline character \n when it’s done with displaying all items. We will talk more about the newline character in Subsection 4.17. Last, file=sys.stdout means that by default, items are displayed using standard output. One can change the destination to a file if desired. But this is not needed very often as there are other standard ways to write data to files. We will talk about files and file operations in Section 13.

4.9 Changing the default behavior of the print function

Being aware of the optional parameters in the print function can be useful sometimes. First let’s try to improve the previous example by adding a period to properly end the sentence:

  name = ’Jennifer’
  print(’Her name was’, name, ’.’)

  Her name was Jennifer .

Uh-oh, that does not look very good. So let’s change the separator sep to an empty text string and retry:

  name = ’Jennifer’
  print(’Her name was ’, name, ’.’, sep=’’)

  Her name was Jennifer.

That looks much better! Notice that we altered the string ’Her name was’ to ’Her name was ’ in order to compensate for the default ’ ’ which was eliminated.

Changing the default value of end might be useful when we want to display items on the same line. Namely, by default, each call to the print function produces a new line:

  prime1 = 2
  prime2 = 3
  prime3 = 5
  print(prime1)
  print(prime2)
  print(prime3)

  2
  3
  5

So let’s change end to an empty space ’ ’:

  prime1 = 2
  prime2 = 3
  prime3 = 5
  print(prime1, end= ’)
  print(prime2, end= ’)
  print(prime3)

  2 3 5

Well, this example is a bit artificial because one could just type

  print(prime1, prime2, prime3)

But imagine that we are calculating and displaying the first 1000 primes, one at a time, using a for loop. (The for loop will be introduced in Subsection 4.22.)

4.10 Undesired trailing spaces and function repr

By trailing spaces we mean empty spaces which are present at the end of a text string. They can get there via user input, when decomposing a large text into sentences or words, and in other ways. Most of the time they are "dirt", and they can be the source of nasty bugs which are difficult to find.

For illustration, let’s print the value of a variable named city:

  print(city)

  Portland

Great! Then the text string stored in the variable city must be ’Portland’, right? Hmm - nope. Let’s print the value of this variable using the function repr which reveals empty spaces and special characters:

  print(repr(city))

  ’Portland  ’

Now if our program was checking the variable city against a database of cities which contains ’Portland’, it would not be found because ’Portland ’ and ’Portland’ are different text strings. That’s what we mean by saying that trailing spaces are unwanted. The function repr is very useful because it helps us to find them. In the next subsection we will show you how to remove them.

4.11 Cleaning text strings with strip, lstrip, and rstrip

Python can remove trailing spaces and other undesired characters both from the right end and left end of a text string using text string methods strip, lstrip, and rstrip. Usually, in addition to empty spaces ’ ’, undesired are also newline characters \n and tab characters \t. The string method strip will remove all of them both from the right end and the left end of the text string:

  state =   \t Alaska  \n  ’
  print(repr(state))
  state = state.strip()
  print(repr(state))

  ’  \t Alaska  \n  ’
  ’Alaska’

Note that calling just state.strip() would not work. The text string state would not be changed because the method strip returns a new text string. The line state = state.strip() actually does two things:

  1. Calling state.strip() creates a copy of the text string state, removes the undesired characters, and returns it.
  2. The new cleaned text string ’Alaska’ is then copied back into the variable state, which overwrites the original text ’ \t Alaska \n ’ that was there before.

The method rstrip does the same thing as strip, but only at the right end of the text string:

  state =   \t Alaska  \n  ’
  print(repr(state))
  state = state.rstrip()
  print(repr(state))

  ’  \t Alaska  \n  ’
  ’  \t Alaska’

And last, the method lstrip does the same thing as strip, but only at the left end of the text string:

  state =   \t Alaska  \n  ’
  print(repr(state))
  state = state.lstrip()
  print(repr(state))

  ’  \t Alaska  \n  ’
  ’Alaska  \n  ’

4.12 Wait - what is a "method" exactly?

We owe you an explanation of the word "method" which we used in the previous subsection. In Python, almost everything is an object, including text strings. We will discuss object-oriented programming in great detail in Sections 14 and 15. For now let’s just say that every object has attributes (data) and methods (functions that work with the data). So, every text string that you create automatically has the methods strip, rstrip and lstrip (and many more). They are called by appending a period ’.’ to the name of the text string variable, and typing the name of the method. For example, this is how one calls the method strip of the text string state:

state.strip()

4.13 Calling text string methods on raw text strings

As a matter of fact, text string methods can also be called on raw text strings. This is not used very frequently, but it’s good to know about it:

  state =   \t Alaska  \n  ’.strip()
  print(repr(state))

  ’Alaska’

4.14 Using single and double quotes in text strings

Since text strings are enclosed in quotes, it can be expected that there will be some limitations on using quotes inside text strings. Indeed, try this:

  txt = ~I said: ~Hello!~~
  print(txt)

Already the syntax highlighting tells you that something is wrong - the first part ~I said: ~ is understood as a text string, but the next Hello! is not a text string because it is outside the quotes. Then ~~ is understood as an empty text string. The error message confirms that Python does not like this:

  on line 1:
      ~I said: ~Hello!~~
                ^
  SyntaxError: Something is missing between ~I said: ~ and Hello.

Python probably expected something like ~I said: ~ + Hello ... which would make sense if Hello was a text string variable, or ~I said: ~ * Hello ... which would make sense if Hello was an integer. We will talk about adding strings and multiplying them with integers in Subsections 4.20 and 4.21, respectively.

The above example can be fixed by replacing the outer double quotes with single quotes, which removes the ambiguity:

  txt = ’I said: ~Hello!~’
  print(txt)

  I said: ~Hello!~

The other way would work too - replacing the inner double quotes with single quotes removes the ambiguity as well:

  txt = ~I said: ’Hello!’~
  print(txt)

  I said: ’Hello!’

However, if we needed to combine single and double quotes in the same text string,

  I said: ~It’s OK!~

this simple solution would not work. Let’s show one approach that works always.

4.15 A more robust approach - using characters \’ and \"

If we want to include single or double quotes in a text string, the safest way is to use them with a backslash:

  txt = ~I said: \~It\’s OK!\~~
  print(txt)

  I said: ~It’s OK!~

4.16 Length of text strings containing special characters

It’s useful to know that the special characters \’ and \~, although they consist of two regular characters, have length 1:

  print(len(’\~’))
  print(len(’\’’))

  1
  1

And one more example:

  txt = ~\~It\’s OK!\~~
  print(txt)
  print(len(txt))

  ~It’s OK!~
  10

4.17 Newline character \n

The newline character \n can be inserted anywhere in a text string. It will cause the print function to end the line at that point and continue with a new line:

  txt = ~Line 1\nLine 2\nLine 3~
  print(txt)

  Line 1
  Line 2
  Line 3

The length of the newline character \n is 1:

  print(len(’\n’))

  1

4.18 Writing long code lines over multiple lines with the backslash \

Here is a paragraph from "The Raven" by Edgar Allan Poe:

Once upon a midnight dreary, while I pondered weak and weary,
Over many a quaint and curious volume of forgotten lore,
While I nodded, nearly napping, suddenly there came a tapping,
As of some one gently rapping, rapping at my chamber door.
’Tis some visitor,’ I muttered, ’tapping at my chamber door -
Only this, and nothing more.’

Let’s say that we want to store this text in a text string variable named edgar. You already know how to use the newline character \n, so it should not be a problem to create the text string using one very long line of code.

However, how can this be done when the Python’s PEP8 style guide

https://www.python.org/dev/peps/pep-0008/
recommends to limit all lines of code to a maximum of 79 characters? The answer is - one can spread a long line of code over two or more lines using the backslash \:

  edgar = ~Once upon a midnight dreary, while I pondered weak \
  and weary,\nOver many a quaint and curious volume of forgotten \
  lore,\nWhile I nodded, nearly napping, suddenly there came a \
  tapping,\nAs of some one gently rapping, rapping at my chamber \
  door.\n‘Tis some visitor,’ I muttered, ‘tapping at my chamber \
  door -\nOnly this, and nothing more.’~
  print(edgar)

Here is how the text string looks when displayed:

  Once upon a midnight dreary, while I pondered weak and weary,
  Over many a quaint and curious volume of forgotten lore,
  While I nodded, nearly napping, suddenly there came a tapping,
  As of some one gently rapping, rapping at my chamber door.
  ’Tis some visitor,’ I muttered, ’tapping at my chamber door -
  Only this, and nothing more.’

4.19 Multiline strings enclosed in triple quotes

Creating multiline text strings with the newline character \n certainly works for most applications. But sometimes, especially when the text string is very long, it can be cumbersome. Therefore, Python makes it possible to create multiline text strings enclosed in triple quotes. Then one does not have to worry about any special characters at all. Here is the previous code, transformed in this way:

  edgar = ~~~\
  Once upon a midnight dreary, while I pondered weak and weary,
  Over many a quaint and curious volume of forgotten lore,
  While I nodded, nearly napping, suddenly there came a tapping,
  As of some one gently rapping, rapping at my chamber door.
  Tis some visitor, I muttered, ‘tapping at my chamber door -
  Only this, and nothing more.’\
  ~~~
  print(edgar)

  Once upon a midnight dreary, while I pondered weak and weary,
  Over many a quaint and curious volume of forgotten lore,
  While I nodded, nearly napping, suddenly there came a tapping,
  As of some one gently rapping, rapping at my chamber door.
  ‘Tis some visitor,’ I muttered, ‘tapping at my chamber door -
  Only this, and nothing more.

In case you wonder about the backslashes - they are there to prevent the text string to begin and end with \n (we indeed started and ended the text string with making a new line).

Here is one more example. The code

  txt = ~~~
  This is the first line,
  and this is the second line.
  ~~~
  print(txt)

will display one empty line at the beginning and one at the end:

 
 
This is the first line,  
and this is the second line.  

Finally, let’s show that creating a multiline text string using the newline character \n and using the triple quote approach are fully equivalent:

  txt1 = ~Line 1\nLine 2\nLine 3~
  txt2 = ~~~Line 1
  Line 2
  Line 3~~~
  print(repr(txt1))
  print(repr(txt2))

  ’Line 1\nLine 2\nLine 3’
  ’Line 1\nLine 2\nLine 3’

4.20 Adding text strings

Text strings can be concatenated (added together) using the same ’+’ operator as numbers:

  word1 = ’Edgar’
  word2 = ’Allan’
  word3 = ’Poe’
  name = word1 + word2 + word3
  print(name)

EdgarAllanPoe

Oops! Of course we need to add empty spaces if we want them to be part of the resulting text string:

  word1 = ’Edgar’
  word2 = ’Allan’
  word3 = ’Poe’
  name = word1 +  ’ + word2 +  ’ + word3
  print(name)

Edgar Allan Poe

4.21 Multiplying text strings with integers

Text strings can be repeated by multiplying them with positive integers. For example:

  txt = ’Help me!’
  print(I yelled + 3 * word)

I yelledHelp me!Help me!Help me!

Again, empty spaces matter! So let’s try again:

  txt =  Help me!’
  print(I yelled + 3 * word)

I yelled Help me! Help me! Help me!

4.22 Parsing text strings with the for loop

Python has a keyword for which can be used to form a for loop, and parse a text string one character at a time:

  word = ’breakfast’
  for c in word:
      print(c, end= ’)

b r e a k f a s t

The mandatory parts of the loop are the keywords for and in, and the colon : at the end of the line. The name of the variable is up to you. For example, with letter instead of c the above code would look like this:

  word = ’breakfast’
  for letter in word:
      print(letter, end= ’)

b r e a k f a s t

The action or sequence of actions to be repeated for each character is called the body of the loop. In this case it is a single line

print(letter, end=’ ’)

but often the loop’s body is several lines long.

Importantly, note that that the body of the loop is indented. The Python style guide https://www.python.org/dev/peps/pep-0008/ recommends to use 4-indents. The for loop will be discussed in more detail in Section 9.

4.23 Reversing text strings with the for loop

The for loop can be used to reverse text strings. The following program shows how to do it, and moreover for clarity it displays the result-in-progress text string new after each cycle of the loop:

  orig = ’breakfast’
  new = ’’
  for c in word:
      new = c + new
      print(new)

b  
rb  
erb  
aerb  
kaerb  
fkaerb  
afkaerb  
safkaerb  
tsafkaerb

Note that this code created a new text string and the original text string stayed unchanged. The reason is that text strings cannot be changed in place – they are immutable objects. We will talk in more detail about mutable and immutablle types in Subsection 7.33. And last, in Subsection 4.32 we will show you a quicker way to reverse text strings based on slicing.

4.24 Accessing individual characters via their indices

All characters in the text string are enumerated. The first character has index 0, the second one has index 1, the third one has index 2, etc. The character with index n can be obtained when appending [n] at the end of the text string:

  word = fast
  print(First character:, word[0])
  print(Second character:, word[1])
  print(Third character:, word[2])
  print(Fourth character:, word[3])

First character: f  
Second character: a  
Third character: s  
Fourth character: t

Make sure that you remember:

Indices in Python start from zero.

4.25 Using negative indices

Sometimes it can be handy to use the index -1 for the last character, -2 for the one-before-last etc:

  word = fast
  print(Fourth character:, word[-1])
  print(Third character:, word[-2])
  print(Second character:, word[-3])
  print(First character:, word[-4])

Fourth character: t  
Third character: s  
Second character: a  
First character: f

4.26 ASCII table

Every standard text character has its own ASCII code which is used to represent it in the computer memory. ASCII stands for American Standard Code for Information Interchange. The table dates back to the 1960s and it contains 256 codes. The first 128 codes are summarized in Fig. 50.


PIC


Fig. 50: ASCII table, codes 0 - 127.


The first 32 codes 0 - 31 represent various non-printable characters, of which some are hardly used today. But some are still used, such as code 8 which means backspace \b, code 9 which means horizontal tab \t, and code 10 which means new line \n. The upper ASCII table (codes 128 - 255) represents various special characters which you can easily find online if you need them.

4.27 Finding the ASCII code of a given character

Python has a built-in function ord which can be used to access ASCII codes of text characters:

  print(ord(’a’))

97

The text character representing the first digit ’0’ has ASCII code 48:

  print(ord(’0’))

48

Since you already know from Subsection 4.22 how to parse text strings with the for loop, it is a simple exercise to convert a text string into a sequence of ASCII codes automatically:

  txt = ’breakfast’
  for c in txt:
      print(ord(c), end= ’)

98 114 101 97 107 102 97 115 116

4.28 Finding the character for a given ASCII code

This is the inverse task to what we did in the previous subsection. Python has a function chr which accepts an ASCII code and returns the corresponding character:

  print(chr(97))

a

It is a simple exercise to convert a list of ASCII codes into a text string:

  L = [98, 114, 101, 97, 107, 102, 97, 115, 116]
  txt = ’’
  for n in L:
      txt += chr(n)
  print(txt)

breakfast

Lists will be discussed in more detail in Section 7.

4.29 Slicing text strings

Python makes it easy to extract substrings of text strings using indices – this is called slicing:

  w1 = bicycle
  w2 = w1[1:4]
  print(w2)

icy

Note that second index (in this case 4) minus the first one (in this case 1) yields the length of the resulting text string. This also means that the character with the second index (in this case w1[4] which is ’c’) is not part of the resulting slice.

Omitting the first index in the slice defaults to zero:

  w3 = w1[:2]
  print(w3)

bi

Omitting the second index defaults to the length of the string:

  w4 = w1[2:]
  print(w4)

cycle

4.30 Third index in the slice

The slice notation allows for a third index which stands for stride. By default it is 1. Setting it to 2 will extract every other character:

  orig = ’circumstances’
  new1 = orig[::2]
  print(new1)

crusacs

Here we omitted the first and second indices in the slice, which means that the entire text string was used.

4.31 Creating copies of text strings

The easiest way to create a new copy of a text string is as follows:

  orig = ’circumstances’
  new2 = orig
  print(new2)

circumstances

Equivalently, one can use slicing. In the case of text strings these two approaches are equivalent because they are immutable objects:

  orig = ’circumstances’
  new3 = orig[:]
  print(new3)

circumstances

Mutability and immutability of objects in Python will be discussed in more detail in Subsection 7.33.

4.32 Reversing text strings using slicing

When the stride is set to -1, the text string will be reversed. This is the fastest and easiest way to reverse text strings:

  orig = ’circumstances’
  new4 = orig[::-1]
  print(new4)

secnatsmucric

Note that typing just orig[::-1] will not change the text string orig because text strings are immutable.

4.33 Retrieving current date and time

Extracting system date and time, and transforming it to various other formats is a nice way to practice slicing. To begin with, let’s import the time library and call the function ctime:

  import time
  txt = time.ctime()
  print(txt)
  print(len(txt))

Mon May 11 18:23:03 2018  
24

The function ctime returns a 24-character text string. Here is its structure:

  print(txt[:3])    # day (three characters)
  print(txt[4:7])   # month (three characters)
  print(txt[8:10])  # date (two characters)
  print(txt[11:19]) # hh:mm:ss (eight characters)
  print(txt[20:])   # year (four characters)

Mon  
May  
11  
18:23:03  
2018

For example, here is a code to display the date in a different format May 11, 2018:

  import time
  txt = time.ctime()
  newdate = txt[4:7] + ’ ’ + txt[8:10] + ’, ’ + txt[20:]
  print(newdate)

May 11, 2018

4.34 Making text strings lowercase

This method returns a copy of the text string where all characters are converted to lowercase. It has an important application in search where it is used to make the search case-insensitive. We will talk about this in more detail in Subsection 4.36. Meanwhile, the following example illustrates how the method lower works:

  txt = She lives in New Orleans.
  new = txt.lower()
  print(new)

she lives in new orleans.

4.35 Checking for substrings in a text string

Python has a keyword in which you already know from Subsection 4.22. There it was used as part of the for loop to parse text strings. The same keyword can be used to check for occurrences of substrings in text strings. The expression

substr in txt
returns True if substring substr is present in text string txt, and False otherwise. Here True and False are Boolean values which will be discussed in more detail in Section 6. In the meantime, here is an example that illustrates the search for substrings:

  txt = ’Adam, Michelle, Anita, Zoe, David, Peter’
  name = ’Zoe’
  result = name in txt
  print(result)

True

Usually, the search for a substring is combined with an if-else statement. Conditions will be discussed in more detail in Section 10, but let’s show a simple example:

  txt = ’Adam, Michelle, Anita, Zoe, David, Peter’
  name = ’Hunter’
  if name in txt:
      print(’The name ’ + name +  is in the text.’)
  else:
      print(’The name ’ + name +  was not found.’)

The name Hunter was not found.

Here, the expression name in txt returned False. Therefore the condition was not satisfied, and the else branch was executed.

4.36 Making the search case-insensitive

As you know, Python is case-sensitive. However, when searching in text strings, one often prefers to make the search case-insensitive to avoid missing some occurrences. For example, searching the following text string txt for the word ’planet’ fails for this reason:

  txt = ’Planet Earth is part of the solar system.’
  substr = ’planet’
  result = substr in txt
  print(result)

False

The solution is to lowercase both text strings, which makes the search case-insensitive:

  txt = ’Planet Earth is part of the solar system.’
  substr = ’planet’
  result = substr.lower() in txt.lower()
  print(result)

True

4.37 Making text strings uppercase

This method works analogously to lower except it returns an uppercased copy of the text string:

  txt = don’t yell at me like that.
  new = txt.upper()
  print(new)

  DON’T YELL AT ME LIKE THAT.

4.38 Finding and replacing substrings

Method replace(old, new[, count]) returns a copy of the original text string where all occurrences of text string old are replaced by text string new:

  txt = First day, second day, and third day.
  substr1 = ’day’
  substr2 = ’week’
  new = txt.replace(substr1, substr2)
  print(new)

First week, second week, and third week.

If the optional argument count is given, only the first count occurrences are replaced:

  txt = First day, second day, and third day.
  substr1 = ’day’
  substr2 = ’week’
  new = txt.replace(substr1, substr2, 2)
  print(new)

First week, second week, and third day.

4.39 Counting occurrences of substrings

Method count(sub[, start[, end]]) returns the number of occurrences of substring sub in the slice [start:end] of the original text string. If the optional parameters start and end are left out, the whole text string is used:

  txt = John Stevenson, John Lennon, and John Wayne.
  print(txt.count(John))

3

What we said in Subsection 4.36 about the disadvantages of case-sensitive search applies here as well. If there is a chance that some occurrences will differ by the case of characters, it is necessary to make the search case-insensitive by lowercasing both text strings:

  txt = Planet Earth, planet Venus, planet Uranus, planet Mars.
  substr = ’planet’
  print(txt.lower().count(substr.lower()))

4

4.40 Locating substrings in text strings

Method find(sub[, start[, end]]) returns the lowest index in the text string where substring sub is found, such that sub is contained in the range [start, end]. If the optional arguments start and end are omitted, the whole text string is used. The method returns -1 if the text string sub is not found:

  txt = They arrived at the end of summer.
  substr = ’end’
  print(txt.find(substr))

20

Case-insensitive version:

  txt = They arrived at the end of summer.
  substr = ’end’
  print(txt.lower().find(substr.lower()))

20

Python also has a method index(sub[, start[, end]]) which works like find, but raises ValueError instead of returning -1 when the substring is not found.

4.41 Splitting a text string into a list of words

Any text string can be split into a list of words using the string method split. (Lists will be discussed in more detail in Section 7.) If no extra arguments are given, the words will be separated by arbitrary strings of whitespace characters (space, tab, newline, return, formfeed, ...):

  txt = ’This is, indeed, a basic use of the split method!’
  L = txt.split()
  print(L)

[’This’, ’is,’, ’indeed,’, ’a’, ’basic’, ’use’, ’of’, ’the’,  
’split’, ’method!’]

It is possible to use another separator by passing it to the method as an optional argument:

  txt = ’This is, indeed, a basic use of the split method!’
  L = txt.split(’,’)
  print(L)

[’This is’, ’ indeed’, ’ a basic use of the split method!’]

4.42 Splitting a text string while removing punctuation

Notice that in the previous subsection, the commas ’,’ and the exclamation mark ’!’ remained in the words ’is,’, ’indeed,’ and ’method!’. If we want to remove them, redefining the delimiter does not help.

In this case it is best to do it manually – first replace any undesired characters with empty space ’ ’ and then use the standard split method without any extra arguments. If we know that we only want to remove ’,’ and ’!’, we can do it one by one as follows:

  txt = ’This is, indeed, a basic use of the split method!’
  txt2 = txt.replace(’,’,  ’)
  txt2 = txt2.replace(’!’,  ’)
  L = txt2.split()
  print(L)

[’This’, ’is’, ’indeed’, ’a’, ’basic’, ’use’, ’of’, ’the’,  
’split’, ’method’]

When we deal with a larger text and want to remove all punctuation characters, we can take advantage of the text string punctuation which is present in the string library:

  import string
  txt = ’This is, indeed, a basic use of the split method!’
  txt2 = txt[:]
  for c in string.punctuation:
      txt2 = txt2.replace(c, ’ ’)
  L = txt2.split()
  print(L)

[’This’, ’is’, ’indeed’, ’a’, ’basic’, ’use’, ’of’, ’the’,  
’split’, ’method’]

4.43 Joining a list of words into a text string

Method join(L) returns a text string which is the concatenation of the strings in the list L. In fact, L can also be a tuple or any sequence. The base string is used as separator:

  txt = ...
  str1 = This
  str2 = is
  str3 = the
  str4 = movie.
  print(txt.join([str1, str2, str3, str4]))

This...is...the...movie.

4.44 Method isalnum

This method returns True if all characters in the string are alphanumeric (letters or numbers, and if there is at least one character. Otherwise it returns False:

  str1 = Hello
  if str1.isalnum():
      print(str1, is alphanumeric.)
  else:
      print(str1, is not alphanumeric.)
  str2 = Hello!
  if str2.isalnum():
      print(str2, is alphanumeric.)
  else:
      print(str2, is not alphanumeric.)

Hello is alphanumeric.  
Hello! is not alphanumeric.

4.45 Method isalpha

This method returns True if all characters in the string are alphabetic and there is at least one character, and False otherwise. By ’alphabetic’ we mean letters only, not numbers and not even empty spaces:

  str1 = John
  if str1.isalpha():
      print(str1, is alphabetic.)
  else:
      print(str1, is not alphabetic.)

John is alphabetic.

And here is another example:

  str2 = My name is John
  if str2.isalpha():
      print(str2, is alphabetic.)
  else:
      print(str2, is not alphabetic.)

My name is John is not alphabetic.

4.46 Method isdigit

This method returns True if all characters in the string are digits and there is at least one character, and False otherwise:

  str1 = 2012
  if str1.isdigit():
      print(str1, is a number.)
  else:
      print(str1, is not a number.)
  str2 = Year 2012
  if str2.isdigit():
      print(str2, is a number.)
  else:
      print(str2, is not a number.)

2012 is a number.  
Year 2012 is not a number.

4.47 Method capitalize

This method returns a copy of the text string with only its first character capitalized. All remaining characters will be lowercased:

  txt = SENTENCES SHOULD START WITH A CAPITAL LETTER.
  print(txt.capitalize())

Sentences should start with a capital letter.

4.48 Method title

This method returns a titlecased version of the string where all words begin with uppercase characters, and all remaining cased characters are lowercase:

  txt = this is the title of my new book
  print(txt.title())

This Is The Title Of My New Book

4.49 C-style string formatting

Python supports C-style string formatting to facilitate creating formatted strings. For example, %s can be used to insert a text string:

  name = ’Prague’
  print(’The city of %s was the favorite destination.’ % name)

The city of Prague was the favorite destination.

Formatting string %d can be used to insert an integer:

  num = 30
  print(’At least %d applicants passed the test.’ % num)

At least 30 applicants passed the test.

Formatting string %.Nf can be used to insert a real number rounded to N decimal digits:

  import numpy as np
  print(’%.3f is Pi rounded to three decimal digits.’ % np.pi)

3.142 is Pi rounded to three decimal digits.

And last, it is possible to insert multiple strings / integers / real numbers as follows:

  n = ’Jim’
  a = 16
  w = 180.67
  print(’%s is %d years old and weights %.0f lbs.’ % (n, a, w))

Jim is 16 years old and weights 181 lbs.

4.50 Additional string methods

In this section we were only able to show the most frequently used text string methods. The string class has many more methods which can be found in the official Python documentation at https://docs.python.org/3.3/library/stdtypes.html# string-methods (scroll down to section "String Methods").

4.51 The string library

The string library contains additional useful string functionality including string constants (such as all ASCII characters, all lowercase characters, all uppercase characters, all printable characters, all punctuation characters, ...), string formatting functionality, etc. More information can be found at https://docs.python.org/3/libra ry/string.html.

4.52 Natural language toolkit (NLTK)

The Natural Language Toolkit (NLTK) available at https://www.nltk.org/ is a leading platform for building Python programs to work with human language data. NLTK is a free, open source, community-driven project which is suitable for linguists, engineers, students, educators, researchers, and industry users. NLTK provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, and wrappers for industrial-strength NLP libraries.


Table of Contents

Created on August 6, 2018 in Python I,   Python II.
Add Comment
0 Comment(s)

Your Comment

By posting your comment, you agree to the privacy policy and terms of service.