Introduction to Python Programming. Section 13. File Operations

13 File Operations

13.1 Objectives

In this section you will learn

  • About the NCLab file system and the security of your data in NCLab.
  • How to open files via the built-in function open and the with statement.
  • Various modes to open files (for reading, writing, appending, etc.).
  • About Latin-1 and UTF-8 encoding of text files.
  • How to parse files one line at a time, and other ways to read data from a file.
  • To consider file size and the amount of available runtime memory.
  • How to use the file pointer and work on the character level.
  • About various ways to write data to a file.
  • That things can go wrong, and one should use exceptions when working with files.

In this section we will mostly work with text files. Binary files will be discussed later in Subsections 13.3013.33.

13.2 Why learn about files?

So far you learned about using various programming techniques to process data in the form of text strings, variables, lists, dictionaries, etc. But all such data, in one way or another, comes from files. Whether stored on the hard disk of your computer or on a cloud server, files are an indivisible part of computer programming, and every software developer must be fluent in working with them.

13.3 NCLab file system and security of your data

The file manager My Files whose icon is present on your NCLab Desktop gives you access to your own NCLab file system which works similarly to the Windows, Linux and other file systems. You can create files and folders, rename them, move them, and delete them. You can also upload files from your hard disk, flash drive, or from the web. Thanks to a generous quota you should never be worried about running out of space. On the backend, your files are stored in a Mongo distributed database system at Digital Ocean, and all data is backed up every day. The communication between your browser and the cloud servers is encrypted for maximum security.

13.4 Creating sample text file

Depending on the PDF viewer you are using to read this text, you should be able to select the block below with your mouse, and use CTRL+C to copy it to clipboard:

                    #            #
                    #            #
  ######## ######## #   ######## ########
  #      # #        #          # #      #
  #      # #        #   ######## #      #
  #      # #        #   #      # #      #
  #      # ######## ### ######## ########

Next, go to the NCLab Desktop, open My Files, and create a new folder named "sandbox". We will be using it throughout this section. Then, open the Creative Suite, and under Accessories launch Notepad. Paste the text there via CTRL+V, and you should see something like this:


PIC


Fig. 54: Notepad is a simple text editor in NCLab.


Then use "Save in NCLab" in the Notepad’s File menu to save the file under the name "logo" to your newly created folder "sandbox". Notepad will automatically add an extension ".txt" to it (which will not be displayed by the file manager – like in Windows). The text "Untitled Text" in the top part of the Notepad window will change to "logo":


PIC


Fig. 55: Your sample text file "logo.txt" is ready to use.


13.5 Creating sample Python file

So far it did not matter where your Python code was located, and it was even OK to run any program in unsaved mode in the Python app. This is different when working with files. The reason is that when opening a file, the interpreter will look for it in the current folder where your Python source code is saved.

Therefore, let’s go to the Creative Suite, launch a new Python worksheet, erase the demo code, and type there a single line f = open(’logo.txt’, mode=’r’) Then use "Save in NCLab" in the File menu to save the file under the name "test" to the folder "sandbox" where you already have your file "logo.txt". The Python app will automatically add an extension ".py" to it which will not be displayed by the file manager. This is how it should look like:


PIC


Fig. 56: Sample file "test.py".


13.6 Opening a text file for reading – method open

In the previous subsection you already met the built-in function open which can be used to open files:

  f = open(’logo.txt’, mode=’r’)

The parameter name mode can be left out for brevity, so usually you will see the function open used as follows:

  f = open(’logo.txt’, ’r’)

The first argument ’logo.txt’ is the file name. The interpreter will look for it in the directory where your Python source code is located. Using subfolders such as ’data/logo.txt’ is OK as well (assuming that there is a subfolder ’data’ in your currect directory which contains the file ’logo.txt’).

The next argument ’r’ means "for reading in text mode". Various other modes to open a file will be mentioned in Subsection 13.13.

The function open has an optional third parameter encoding whose default value is ’utf-8’. We will talk about encoding in Subsection 13.11.

Finally, the variable f will store a pointer to the file contents. As everything else in Python, file is a class. In the following subsections we will explain various methods of this class which can be used to work with data in files.

13.7 Closing a file – method close

The file class has a method close to close the file. An open file named f can be closed by typing

  f.close()

Nothing happens when one tries to close a file which is not open.

13.8 Checking whether a file is closed

To check whether a file f is closed, one can use the attribute closed of the file class:

  f.closed

which is either True or False.

13.9 Using the with statement to open files

The preferred way to open files is using the with statement:

  with open(’logo.txt’, ’r’) as f:

The body of the with statement is indented analogously to loops and conditions. There is no need to use the close method – the file will be closed automatically, even if any exceptions are raised. This is a good alternative to try-finally blocks which were introduced in Subsection 12.8. More details on the usage of the with statement can be found in the PEP 343 section of the official Python documentation at https://www.python.org/dev/peps/pep-0343/.

13.10 Getting the file name

The file class has an attribute name which can be used to obtain the filename:

  with open(’logo.txt’, ’r’) as f:
      print(f.name)

  logo.txt

13.11 A word about encoding, Latin-1, and UTF-8

In Subsections 4.264.28 we discussed the ASCII table which translates Latin text characters and other widely used text symbols into decimal integer codes and vice versa. The decimal integer codes are then translated into the corresponding binary numbers (zeros and ones) before they are saved to the disk. For example, the ASCII text string

  Hello!

becomes

  010010000110010101101100011011000110111100100001

Every 8-bit (1-Byte) segment in this sequence represents one ASCII character. For example, 01001000 is ’H’ etc.

The process of converting text into a binary sequence is called encoding. It only applies to text files, not to binary files. The above example, up to some technical details, describes Latin-1 (ISO-8859-1) encoding. This was the standard encoding for Western European languages until the advent of Unicode (UTF-8) encoding in 1991.

Unicode – which for simplicity can be understood as "extended ASCII" – was created in order to accommodate accents (ñ, ü, ř, ç, ...), Japanese, Chinese and Arabic symbols, and various special characters coming from other languages. While the original 8-bit Latin-1 (ISO-8859-1) only can encode 256 characters, Unicode (UTF-8) can encode 1,114,112. It uses a variable number of bits for various characters, trying to be as economical as possible. The exact scheme is beyond the scope of this textbook but you can find it on the web.

Importantly, UTF-8 is backward compatible with Latin-1 (ISO-8859-1) in the sense that UTF-8 codes of ASCII characters are the same as their Latin-1 (ISO-8859-1) codes. In other words, it does not matter whether UTF-8 or Latin-1 (ISO-8859-1) encoding is used for ASCII text.

Importantly, UTF-8 encoding is the standard in Python 3, so you don’t have to worry about using language-specific non-ASCII characters in your text strings. We will stop here for now, but feel free to learn more about encoding in the official Python documentation online.

13.12 Checking text encoding

Class file has an attribute encoding which can be used to obtain information about text encoding:

  with open(’logo.txt’, ’r’) as f:
      print(f.encoding)

  utf-8

The file class has a number of other useful attributes besides name and encoding. Feel free to learn more at https://docs.python.org/3/library/index.html.

13.13 Other modes to open a file

For reference, here is an overview of various modes to open a file in Python:

  • ’r’: This is the default mode. It opens the file for reading as text (in UTF-8 encoding). Starts reading the file from the beginning. If the file is not found,
    FileNotFoundError exception is raised.
  • ’rb’: Like ’r’ but opens the file for reading in binary mode.
  • ’r+’: Opens the file for reading and writing. File pointer is placed at the beginning of the file (the file pointer will be discussed in Subsection 13.23).
  • ’w’: This mode opens the file for writing text. If the file does not exist, it creates a new file. If the file exists, it truncates the file (erases all contents).
  • ’wb’: Like ’w’ but opens the file for writing in binary mode.
  • ’w+’: Like ’w’ but also allows to read from file.
  • ’wb+’: Like ’wb’ but also allows to read from file.
  • ’a’: Opens file for appending. Starts writing at the end of file. If the file does not exist, creates a new file.
  • ’ab’: Like ’a’ but in binary format.
  • ’a+’: Like ’a’ but also allows to read from file.
  • ’ab+’: Like ’ab’ but also allows to read from file.
  • ’x’: Creates a new file. If the file already exists, the operation fails.

13.14 Reading the file line by line using the for loop

Great, now let’s do something with our sample file "logo.txt"!

To begin with, we will just display its contents. Python has a convenient way of reading an open text file one line at a time via the for loop. Adjust your sample Python file "test.py" as follows:

  with open(’logo.txt’, ’r’) as f:
      for line in f:
          print(line)

Here line is not a keyword – it is a name for the text string representing the next line in the file f, and we could use some other name if we wanted.

But when you run the code, you will see this strange output:

                    #            #
  
                    #            #
  
  ######## ######## #   ######## ########
  
  #      # #        #          # #      #
  
  #      # #        #   ######## #      #
  
  #      # #        #   #      # #      #
  
  #      # ######## ### ######## ########

Well, that definitely does not look right! We will look into this mystery in the following subsection. But first let’s make sure that indeed every line extracted from the file f is a text string:

  with open(’logo.txt’, ’r’) as f:
      for line in f:
          print(type(line))

  <class ’str’>
  <class ’str’>
  <class ’str’>
  <class ’str’>
  <class ’str’>
  <class ’str’>
  <class ’str’>

13.15 The mystery of empty lines

To solve the mystery of empty lines from the previous subsection, let’s use the built-in function repr that you know from Subsection 4.10:

  with open(’logo.txt’, ’r’) as f:
      for line in f:
          print(repr(line))

  ’                  #            #\n’
  ’                  #            #\n’
  ’######## ######## #   ######## ########\n’
  ’#      # #        #          # #      #\n’
  ’#      # #        #   ######## #      #\n’
  ’#      # #        #   #      # #      #\n’
  ’#      # ######## ### ######## ########’

Oh, now this makes more sense! Each line extracted from the file (with the exception of the last one) contains the newline character \n at the end, and the print function adds one more by default. As a result, the newline characters are doubled, which causes the empty lines to appear.

This problem can be cured in two different ways. First, we can prevent the print function from adding the newline character \n after each line:

  with open(’logo.txt’, ’r’) as f:
      for line in f:
          print(line, end=’’)

                    #            #
                    #            #
  ######## ######## #   ######## ########
  #      # #        #          # #      #
  #      # #        #   ######## #      #
  #      # #        #   #      # #      #
  #      # ######## ### ######## ########

But this does not remove the newline characters from the text strings, which might cause other problems. So let’s remove them using the string method rstrip from Subsection 4.11:

  with open(’logo.txt’, ’r’) as f:
      for line in f:
          line = line.rstrip()
          print(line)

                    #            #
                    #            #
  ######## ######## #   ######## ########
  #      # #        #          # #      #
  #      # #        #   ######## #      #
  #      # #        #   #      # #      #
  #      # ######## ### ######## ########

Note that we cannot use the string method strip here because some of the lines contain empty spaces on the left which we need to remain there.

13.16 Reading individual lines – method readline

The file class has a method readline which only reads one line from the file. This makes reading from files more flexible. For example, one might want to only read the first three lines:

  with open(’logo.txt’, ’r’) as f:
      for i in range(3):
          line = f.readline()
          line = line.rstrip()
          print(line)

                    #            #
                    #            #
  ######## ######## #   ######## ########

Or, the file might have a two-line header that needs to be skipped:

  % Created by Some Awesome Software
  % Date: January 1, 1111
                    #            #
                    #            #
  ######## ######## #   ######## ########
  #      # #        #          # #      #
  #      # #        #   ######## #      #
  #      # #        #   #      # #      #
  #      # ######## ### ######## ########

To skip the header, one just needs to call readline two times before using the for loop:

  with open(’logo.txt’, ’r’) as f:
      dummy = f.readline()
      dummy = f.readline()
      for line in f:
          line = line.rstrip()
          print(line)

                    #            #
                    #            #
  ######## ######## #   ######## ########
  #      # #        #          # #      #
  #      # #        #   ######## #      #
  #      # #        #   #      # #      #
  #      # ######## ### ######## ########

When the end of file is reached and there is no next line to read, readline returns an empty string:

  with open(’logo.txt’, ’r’) as f:
      for line in f:
          pass
      line = f.readline()
      print(line)

  ’’

Finally let’s mention that one can call the method as f.readlines(n) where n is the number of bytes (characters) to be read from the file f. In other words, it is possible to parse the file f, reading just the first n characters from each line.

13.17 Reading the file line by line using the while loop

The method readline from the previous subsection can be combined with the while loop as well:

  with open(’logo.txt’, ’r’) as f:
      while True:
          line = f.readline()
          if not line:
              break
          line = line.rstrip()
          print(line)

                    #            #
                    #            #
  ######## ######## #   ######## ########
  #      # #        #          # #      #
  #      # #        #   ######## #      #
  #      # #        #   #      # #      #
  #      # ######## ### ######## ########

Notice how if not line was used to detect that end of file was reached. This works because when the end of file is reached, line is an empty string, and if not applied to an empty string yields True (see Subsection 10.6).

13.18 Reading the file using next

The file class is iterable. Therefore it has the method __next__ and it can also be parsed with the built-in function next (see Subsections 9.13 - 9.16). Both can be used analogously to readline to access individual lines or parse the file line-by-line. This time we will enumerate the lines in the file "logo.txt".

First let’s do this using the method __next__:

  with open(’logo.txt’, ’r’) as f:
      for i in range(7):
          print(’Line’, str(i+1) + ’:  ’ + f.__next__().rstrip())

  Line 1:                    #            #
  Line 2:                    #            #
  Line 3:  ######## ######## #   ######## ########
  Line 4:  #      # #        #          # #      #
  Line 5:  #      # #        #   ######## #      #
  Line 6:  #      # #        #   #      # #      #
  Line 7:  #      # ######## ### ######## ########

Alternatively, one can use next:

  with open(’logo.txt’, ’r’) as f:
      for i in range(7):
          print(’Line’, str(i+1) + ’:  ’ + next(f).rstrip())

  Line 1:                    #            #
  Line 2:                    #            #
  Line 3:  ######## ######## #   ######## ########
  Line 4:  #      # #        #          # #      #
  Line 5:  #      # #        #   ######## #      #
  Line 6:  #      # #        #   #      # #      #
  Line 7:  #      # ######## ### ######## ########

13.19 Reading the file as a list of lines – method readlines

The file class has a method readlines which makes it possible to read the entire file at once, and returns a list of lines (as text strings):

  with open(’logo.txt’, ’r’) as f:
      L = f.readlines()
      print(L)

    [’                  #            #\n’,
     ’                  #            #\n’,
     ’######## ######## #   ######## ########\n’,
     ’#      # #        #          # #      #\n’,
     ’#      # #        #   ######## #      #\n’,
     ’#      # #        #   #      # #      #\n’,
     ’#      # ######## ### ######## ########’]

And there is an even shorter way to do this:

  with open(’logo.txt’, ’r’) as f:
      L = list(f)
      print(L)

    [’                  #            #\n’,
     ’                  #            #\n’,
     ’######## ######## #   ######## ########\n’,
     ’#      # #        #          # #      #\n’,
     ’#      # #        #   ######## #      #\n’,
     ’#      # #        #   #      # #      #\n’,
     ’#      # ######## ### ######## ########’]

As you can see, in neither case were the text strings cleaned from the newline characters at the end. But there is something more important to say about this approach, which we will do in the following subsection.

13.20 File size vs available memory considerations

By RAM (Random-Access Memory) we mean runtime memory, or in other words the memory where the computer stores data while running programs. As opposed to hard disk space which is cheap, RAM is expensive, and therefore standard computers do not have much. Usually, as of 2018, standard desktop computers or laptops can have anywhere between 1 GB and 8 GB of RAM.

The file method readlines from the previous subsection will paste the entire contents of the file into the RAM, without worrying about the file size. So, one has to be careful here. If the size of the file is several GB, then there is a real possibility of running out of memory. Therefore, always be aware of the file size when working with files, and only use methods such as readlines when you are sure that the file is small.

Importantly, reading files with the for loop does not have this problem and it is safe even for very large files.

13.21 Reading the file as a single text string – method read

The file class has a method read which returns the entire contents of the text file as a single text string:

  with open(’logo.txt’, ’r’) as f:
      txt = f.read()
      print(txt)

                    #            #
                    #            #
  ######## ######## #   ######## ########
  #      # #        #          # #      #
  #      # #        #   ######## #      #
  #      # #        #   #      # #      #
  #      # ######## ### ######## ########

Since the entire file is pasted into the RAM at once, this method obviously suffers from the same problems with large files as the method readlines. The method read will read the file from the current position of the file pointer until the end of the file. The method read accepts an optional integer argument which is the number of characters to be read. An example will be shown in Subsection 13.23.

13.22 Rewinding a file

Sometimes one needs to go twice through the contents of a file. To illustrate this, let’s create a new text file "ages.txt" in the folder "sandbox":


PIC


Fig. 57: Text file "ages.txt" containing names and ages of people.


This sample file is very small, and therefore it can be pasted to RAM. But imagine a real-life scenario where the file size is several GB, so you only can read it one line at a time. Your task is to find the person (or multiple persons) with the highest age, and return their name(s) as a list.

The easiest solution has three steps:

  1. Pass through the file f once to figure out the highest age ha.
  2. Rewind the file f. This can be done by calling f.seek(0).
  3. Make a second pass and collect all people whose age matches ha.

Typing f.seek(0) will move the file pointer to the initial position at the beginning of the file f. We will talk about the file pointer and this method in more detail in the next subsection. And here is the corresponding code:

  with open(’ages.txt’, ’r’) as f:
      # First get the highest age ’ha’:
      ha = 0
      for line in f:
          line = line.rstrip()
          name, age = line.split()
          age = int(age)
          if age > ha:
              ha = age
      # Rewind the file:
      f.seek(0)
      # Next collect all people whose age matches ’ha’:
      L = []
      for line in f:
          line = line.rstrip()
          name, age = line.split()
          age = int(age)
          if age == ha:
              L.append(name)
  
  print(’Highest age is’, ha)
  print(’People:’, L)

  Highest age is 16
  People: [’Alyssa’, ’Hunter’]

13.23 File pointer, and methods read, tell and seek

On the most basic level, a text file is a sequence of characters, and in some cases one needs to work with this sequence on a character-by-character basis. The methods seek, read and tell serve this purpose. But before we show how they are used, let’s introduce the file pointer.

File pointer

The file pointer is an integer number which corresponds to the current position in the text file. When the file is opened for reading, the pointer is automatically set to 0. Then it increases by one with each new character read (or skipped).

Method read

From Subsection 13.21 you know that the method read accepts an optional integer argument which is the number of characters to be read. So, after calling f.read(6) the pointer will increase to 6. After calling once more f.read(3), it will increase to 12 etc.

Method tell

The position of the pointer in an open file f can be obtained by typing f.tell() (some exceptions apply when the file is used as an iterator – see Subsection 13.24). In the following example, we will read the first line from the file "ages.txt" from Subsection 13.23,

  Nathan 12

and look at the positions of the file pointer. More precisely, we will open the file and read the first 6 characters, then 3 more, and finally the newline character \n at the end of the line:

  with open(’ages.txt’, ’r’) as f:
      pos = f.tell()
      print(pos)
      name = f.read(6)
      print(repr(name))
      pos = f.tell()
      print(pos)
      age = f.read(3)
      print(repr(age))
      pos = f.tell()
      print(pos)
      c = f.read(1)
      print(repr(c))
      pos = f.tell()
      print(pos)

  0
  ’Nathan’
  6
  ’ 12’
  9
  ’\n’
  10

Method seek

You already know the method seek from Subsection 13.22. More precisely, you know that f.seek(0) will reset the position of the file pointer to 0, which effectively rewinds the file to the beginning. But this method has more uses:

  • Typing f.seek(n) which is the same as f.seek(n, 0) will set the pointer in the file f to position n (counted from the beginning of the file).
  • Typing f.seek(n, 1) will move the pointer forward by n positions (counted from the current position). This number can be negative, moving the file pointer backward.
  • Typing f.seek(n, 2) will set the pointer to position n (counted from the end of the file backward).

For illustration, let’s improve the previous example and use f.seek(1, 1) to skip the empty character between the words "Nathan" and "12":

  with open(’ages.txt’, ’r’) as f:
      pos = f.tell()
      print(pos)
      name = f.read(6)
      print(repr(name))
      pos = f.tell()
      print(pos)
      f.seek(1, 1)
      age = f.read(2)
      print(repr(age))
      pos = f.tell()
      print(pos)
      c = f.read(1)
      print(repr(c))
      pos = f.tell()
      print(pos)

  0
  ’Nathan’
  6
  ’12’
  9
  ’\n’
  10

As a last example, let’s illustrate using seek with a negative offset to move the file pointer backward. We will read the first 9 characters in the file "ages.txt", then move the pointer back by 2 positions, and read two characters again:

  with open(’ages.txt’, ’r’) as f:
      line = f.read(9)
      print(repr(line))
      f.seek(-2, 1)
      age = f.read(2)
      print(repr(age))

  ’Nathan 12’
  ’12’

This is enough for the moment, and we will return to the method seek again in Subsection 13.28 in the context of writing to files.

13.24 Using tell while iterating through a file

The method tell is disabled when the file is used as an iterator (parsed using a for loop, using the method __next__, or using the built-in function next). Here is an example where tell behaves in an unexpected way because of this:

  with open(’ages.txt’, ’r’) as f:
      for line in f:
          print(f.tell())

  69
  69
  69
  69
  69
  69
  69

The reason is that a read-ahead buffer is used to increase efficiency. As a result, the file pointer advances in large steps across the file as one iterates over lines.

One may try to circumvent this problem by avoiding the for loop, but unfortunately also using readline disables tell:

  with open(’ages.txt’, ’r’) as f:
      while True:
          print(f.tell())
          line = f.readline()
          if not line:
              break

  0
  69
  69
  69
  69
  69
  69
  69

In short, trying to combine tell with optimized higher-level file operations is pretty much helpless. If you need to solve a task like this, the easiest way out is to define your own function readline which is less efficient than the built-in version but does not disable tell:

  def myreadline(f):
      ~~~
      Analogous to readline but does not disable tell.
      ~~~
      line = ’’
      while True:
          c = f.read(1)
          if c == ’\n’ or c == ’’:
              return line
          line += c
  
  with open(’ages.txt’, ’r’) as f:
      while True:
          print(f.tell())
          line = myreadline(f)
          if not line:
              break
          print(line)

  0
  Nathan 12
  10
  Alyssa 16
  20
  Zoe     9
  30
  Peter  13
  40
  Kerry  14
  50
  Hunter 16
  60
  Angie  15
  69

13.25 Another sample task for tell

Imagine that your next task is to look for a particular word (name, number, ...) in the file, and return the list of all lines where the word is present. But the result should not be a list of text strings – instead, it should be a list of pointer positions which correspond to the beginning of each line. To be concrete, let’s say that in the file "ages.txt" we must find the beginnings of all lines which contain the character ’1’.

The best way to solve this task is to remember the position of the pointer before reading each line. But as you know from the previous subsection, this will be tricky because you will not be able to use the built-in method readline. Again, the best solution is to use your own method readlines (which was defined in the previous subsection). Here is the main program:

  s = ’1’
  with open(’ages.txt’, ’r’) as f:
      while True:
          n = f.tell()
          line = myreadline(f)
          if not line:
              break
          if s in line:
              print(n)

  0
  Nathan 12
  10
  Alyssa 16
  30
  Peter  13
  40
  Kerry  14
  50
  Hunter 16
  60
  Angie  15

13.26 Writing text to a file – method write

In Subsection 13.13 you have seen various modes which can be used to open a file for reading, writing or appending. Mode ’w’ will create a new text file if a file with the given name does not exist. If the file exists, it will be opened and truncated (file pointer set to the beginning).

When writing to a file, it matters where the Python source file is located, because the file will be created in the current directory (this is the same as when opening a file for reading – see Subsection 13.5). Of course one can use an absolute path but in our case this is not needed.

Let’s replace the code in the file "test.py" in folder "sandbox" with

  with open(’greeting.txt’, ’w’) as f:
      f.write(’Hello!’)

After running the program, a new file named "greeting.txt" will appear in that folder:


PIC


Fig. 58: Newly created file "greeting.txt".


Note that the write method does not add the newline character \n at the end of the text string. The code

  with open(’greeting.txt’, ’w’) as f:
      f.write(’Hello!’)
      f.write(’Hi there!’)
      f.write(’How are you?’)

will result into a one line text file:


PIC


Fig. 59: The write method does not add the newline character at the end of the text string.


If we want to have multiple lines, we have to insert newline characters manually:

  with open(’greeting.txt’, ’w’) as f:
      f.write(’Hello!\n’)
      f.write(’Hi there!\n’)
      f.write(’How are you?’)

The result is now what one would expect:


PIC


Fig. 60: Newline characters must be inserted in the text strings manually.


13.27 Writing a list of lines to a file – method writelines

Staying with the previous example, the three lines can be written into the file elegantly using a list and the method writelines. This method writes a list of text strings into the given file:

  L = [’Hello!’, ’Hi there!’, ’How are you?’]
  with open(’greeting.txt’, ’w’) as f:
      f.writelines(L)

A bit disappointingly, newline characters are still not added at the end of the lines:


PIC


Fig. 61: Newline characters are not added automatically.


But after adding the newline characters manually,

  L = [’Hello!\n’, ’Hi there!\n’, ’How are you?’]
  with open(’greeting.txt’, ’w’) as f:
      f.writelines(L)

one obtains the desired result:


PIC


Fig. 62: Adding newline characters produces the desired result.


13.28 Writing to the middle of a file

In Subsection 13.23 we used the method seek to read from various parts of the file. But what happens if we want to write to some place in the middle of an open file? Will the new text be inserted or overwrite the text that is already there? We will show what will happen. Let’s begin with introducing a sample file "news.txt":


PIC


Fig. 63: Sample file "news.txt".


The name "Marconi" starts at position 51. Let’s open the file in the r+ mode, move the pointer to this position, and write there Marconi’s first name, "Guglielmo ":

  with open(’news.txt’, ’r+’) as f:
      f.seek(51, 0)
      f.write(’Guglielmo ’)

However, when the code is run, the contents of the file "news.txt" starting at position 51 is overwritten with the new text:


PIC


Fig. 64: Text beginning at position 51 was overwritten with the new text.


So this did not work. As a matter of fact, the text is a sequence of zeros and ones on the disk. It behaves like a sentence written with a pencil on paper. You can erase and overwrite part of it, but inserting new text in the middle is not possible. We will have to read the whole file as a text string, slice it, insert the first name in the middle, and then write the new text string back to the original file:

  with open(’news.txt’, ’r’) as f:
      txt = f.read()
  with open(’news.txt’, ’w’) as f:
      f.write(txt[:51] + ’Guglielmo ’ + txt[51:])

Finally, the desired result:


PIC


Fig. 65: Inserting new text required reading the whole file as a text string, slicing it, and then writing the result back into the original file.


13.29 Things can go wrong: using exceptions

A number of things can fail when working with files. When reading files:

  • The file you want to open may not exist.
  • You may not have sufficient privileges to open the file.
  • The file may be corrupted.
  • You may run out of memory while reading the file.

When writing to files:

  • The folder where you want to create the new file may not exist or be read-only.
  • You may not have sufficient privileges to write to the folder.
  • You may run out of disk space while writing data to a file, perhaps by using an infinite loop by mistake, etc.

The most frequent exceptions related to working with files are:

  • FileExistsError is raised when trying to create a file or directory which already exists.
  • FileNotFoundError is raised when a file or directory is requested but doesn’t exist.
  • IsADirectoryError is raised when a file operation is requested on a directory.
  • PermissionError is raised when trying to run an operation without the adequate access rights.
  • UnicodeError is raised when a Unicode-related encoding or decoding error occurs.
  • OSError is raised when a system function returns a system-related error, including I/O failures such as "file not found" or "disk full".
  • MemoryError is raised when an operation runs out of memory.

It always is a good idea to put the code that works with the file into a try branch. (The try-except statement including its full form try-except-else-finally was discussed in Section 12.) An example:

  try:
      with open(’myfile.txt’, ’r’) as f:
          txt = f.read()
  except FileNotFoundError:
      print(~File myfile.txt was not found.~)
      ... take a corrective action ...
  except MemoryError:
      print(~Ran out of memory while reading file myfile.txt.~)
      ... take a corrective action ...

Recall from Subsection 12.8 that optional else block can be added for code to be executed if no exceptions were raised, and a finally block for code to be executed always, no matter what happens in the previous blocks.

13.30 Binary files I – checking byte order

Binary files are suitable for storing binary streams (sequences of 0s and 1s) which not necessarily represent text – such as images or executable files. Before reading from or writing to a binary file, one needs to check the native byte order of the host platform. It can be either little-endian (bits are ordered from the little end = least-significant bit) or big-endian (bits are ordered from the big end = most-significant bit).

Big-endian is the most common format in data networking; fields in the protocols of the Internet protocol suite, such as IPv4, IPv6, TCP, and UDP, are transmitted in big-endian order. For this reason, big-endian byte order is also referred to as network byte order. Little-endian storage is popular for microprocessors, in part due to significant influence on microprocessor designs by Intel Corporation. We will not go into more detail here but you can easily find more information online.

To check the byte order, import the sys module and print sys.byteorder:

  import sys
  print(sys.byteorder)

  ’little’

Hence in our case, the byte order is little-endian.

13.31 Binary files II – writing unsigned integers

Data written to binary files, even raw binary sequences, can always be represented as unsigned (= positive) integers. For instance, it can be 8-bit (1-Byte) segments. Or 2-Byte segments. Or entries consisting of one 2-Byte and one 4-Byte segment. Let’s say that our data entries consist of a 2-Byte id and a 4-Byte value (the latter scenario). To packetize the data into binary format, one needs to import the struct module:

  import struct

This module performs conversions between Python values and C structs represented as Python bytes objects. It is described at the Python Standard Library documentation page https://docs.python.org/3/library/struct.html.

Next, let’s open a binary file "datafile.bin" for writing, packetize a sample data pair id = 345, value = 6789 and write it to the file. The file will be automatically closed at the end of the with statement:

  import struct
  id = 345
  value = 6789
  with open(’datafile.bin’, ’wb’) as f:
      data = struct.pack(’<HI’, id, value)
      f.write(data)

Here, ’<HI’ is a formatting string where < means little-endian byte order (> would be used for big-endian), H means a 2-Byte unsigned integer, and I means a 4-Byte unsigned integer. The following table summarizes the most widely used unsigned data formats:




 Symbol   Python data type   Length in Bytes  






B unsigned integer 1
H unsigned integer 2
I unsigned integer 4
Q unsigned integer 8



For a complete list of symbols see the above URL.

Writing to and reading from binary files can be simplified using the Pickle module – this will be discussed in Subsection ??.

13.32 Binary files III – writing bit sequences

In the previous subsection we saw how to write to binary files unsigned integers of various lengths (8, 16, 32 and 64 bits). Now, let’s say that we want to write 8-bit binary sequences. These may be parts of a long binary sequence coming from an image, and not represent numbers at all. But for the sake of writing to a binary file, we will convert them into unsigned integers. This is easy. For instance, the bit sequence ’11111111’ represents unsigned decimal integer 255:

  binseq = ’11111111’
  numrep = int(binseq, 2)
  print(numrep)

  255

The value 2 in int(binseq, 2) stands for base-2 (binary) number. Using the code from the previous subsection, an 8-bit binary sequence binseq can be written to a binary file "datafile.bin" as follows:

  import struct
  binseq = ’11111111’
  with open(’datafile.bin’, ’wb’) as f:
      data = struct.pack(’<B’, int(binseq, 2))
      f.write(data)

Here < means little-endian byte order and B stands for 1-Byte (8-bit) unsigned integer.

As another example, let’s write a 32-bit sequence 1101111011000111011000111001 1011:

  import struct
  binseq = ’11011110110001110110001110011011’
  with open(’datafile.bin’, ’wb’) as f:
      data = struct.pack(’<I’, int(binseq, 2))
      f.write(data)

Here I stands for 4-Byte (32-bit) unsigned integer. On the way, the binary sequence was converted into an integer value 3737609115 (which really does not matter).

Finally, the binary sequence may be coming not as a text string but directly as a base-2 integer, such as val=0b11011110110001110110001110011011. In this case, the conversion step using int is not needed, and one can save the number to the file directly:

  import struct
  val = 0b11011110110001110110001110011011
  with open(’datafile.bin’, ’wb’) as f:
      data = struct.pack(’<I’, val)
      f.write(data)

Obviously, a for loop can be used to write as many 8-bit, 16-bit, 32-bit or 64-bit sequences as needed.

13.33 Binary files IV – reading byte data

Let’s get back to Subsection 13.31 where we created a binary file "datafile.bin" and wrote two unsigned integers id = 345 (2 Bytes) and value = 6789 (4 Bytes) to it. The two unsigned integers can be retrieved from the file by opening it for binary reading, read the 2-Byte and 4-Byte byte strings, and convert them into unsigned integers as follows:

  with open(’datafile.bin’, ’rb’) as f:
      data = f.read(2)
      id = int.from_bytes(data, byteorder=’little’, signed=False)
      print(id)
      data = f.read(4)
      value = int.from_bytes(data, byteorder=’little’, signed=False)
      print(value)

  345
  6789

Note: We are using the fact that our system is little-endian (see Subsection 13.30). When the file is open for binary reading using the flag ’rb’, the method read returns the so-called byte string. Typing read(1) returns a byte string of length 1 Byte, read(2) of length 2 Bytes etc. The byte string is not a regular text string. One needs to know what the byte string represents in order to decode it correctly. In our case, we knew that both byte strings represented unsigned integers, therefore the method from_bytes of class int could be used to decode it.

Now let’s get back to the example from Subsection 13.32 where we saved the binary sequence ’11011110110001110110001110011011’ to the file "datafile.bin". Here is how to read it back:

  with open(’datafile.bin’, ’rb’) as f:
      data = f.read(4)
      value = int.from_bytes(data, byteorder=’little’, signed=False)
      print(bin(value))           # binary (base-2) integer
      print(str(bin(value))[2:])  # text string of 0s and 1s

  0b11011110110001110110001110011011
  11011110110001110110001110011011


Table of Contents

Created on August 6, 2018 in Python I,   Python II.
Add Comment
0 Comment(s)

Your Comment

By posting your comment, you agree to the privacy policy and terms of service.