Introduction to Python Programming. Section 13. File Operations
13 File Operations
13.1 Objectives
In this section you will learn
- About the NCLab file system and the security of your data in NCLab.
- How to open files via the built-in function open and the with statement.
- Various modes to open files (for reading, writing, appending, etc.).
- About Latin-1 and UTF-8 encoding of text files.
- How to parse files one line at a time, and other ways to read data from a file.
- To consider file size and the amount of available runtime memory.
- How to use the file pointer and work on the character level.
- About various ways to write data to a file.
- That things can go wrong, and one should use exceptions when working with files.
In this section we will mostly work with text files. Binary files will be discussed later in Subsections 13.30 – 13.33.
13.2 Why learn about files?
So far you learned about using various programming techniques to process data in the form of text strings, variables, lists, dictionaries, etc. But all such data, in one way or another, comes from files. Whether stored on the hard disk of your computer or on a cloud server, files are an indivisible part of computer programming, and every software developer must be fluent in working with them.
13.3 NCLab file system and security of your data
The file manager My Files whose icon is present on your NCLab Desktop gives you access to your own NCLab file system which works similarly to the Windows, Linux and other file systems. You can create files and folders, rename them, move them, and delete them. You can also upload files from your hard disk, flash drive, or from the web. Thanks to a generous quota you should never be worried about running out of space. On the backend, your files are stored in a Mongo distributed database system at Digital Ocean, and all data is backed up every day. The communication between your browser and the cloud servers is encrypted for maximum security.
13.4 Creating sample text file
Depending on the PDF viewer you are using to read this text, you should be able to select the
block below with your mouse, and use CTRL+C to copy it to clipboard:
Next, go to the NCLab Desktop, open My Files, and create a new folder named "sandbox". We will be using it throughout this section. Then, open the Creative Suite, and under Accessories launch Notepad. Paste the text there via CTRL+V, and you should see something like this:
Then use "Save in NCLab" in the Notepad’s File menu to save the file under the name "logo" to your newly created folder "sandbox". Notepad will automatically add an extension ".txt" to it (which will not be displayed by the file manager – like in Windows). The text "Untitled Text" in the top part of the Notepad window will change to "logo":
13.5 Creating sample Python file
So far it did not matter where your Python code was located, and it was even OK to run any program in unsaved mode in the Python app. This is different when working with files. The reason is that when opening a file, the interpreter will look for it in the current folder where your Python source code is saved.
Therefore, let’s go to the Creative Suite, launch a new Python worksheet, erase the demo code, and type there a single line f = open(’logo.txt’, mode=’r’) Then use "Save in NCLab" in the File menu to save the file under the name "test" to the folder "sandbox" where you already have your file "logo.txt". The Python app will automatically add an extension ".py" to it which will not be displayed by the file manager. This is how it should look like:
13.6 Opening a text file for reading – method open
In the previous subsection you already met the built-in function open which can be used to
open files:
The parameter name mode can be left out for brevity, so usually you will see the function
open used as follows:
The first argument ’logo.txt’ is the file name. The interpreter will look for it in the directory where your Python source code is located. Using subfolders such as ’data/logo.txt’ is OK as well (assuming that there is a subfolder ’data’ in your currect directory which contains the file ’logo.txt’).
The next argument ’r’ means "for reading in text mode". Various other modes to open a file will be mentioned in Subsection 13.13.
The function open has an optional third parameter encoding whose default value is ’utf-8’. We will talk about encoding in Subsection 13.11.
Finally, the variable f will store a pointer to the file contents. As everything else in Python, file is a class. In the following subsections we will explain various methods of this class which can be used to work with data in files.
13.7 Closing a file – method close
The file class has a method close to close the file. An open file named f can be closed by
typing
Nothing happens when one tries to close a file which is not open.
13.8 Checking whether a file is closed
To check whether a file f is closed, one can use the attribute closed of the file
class:
which is either True or False.
13.9 Using the with statement to open files
The preferred way to open files is using the with statement:
The body of the with statement is indented analogously to loops and conditions. There is no need to use the close method – the file will be closed automatically, even if any exceptions are raised. This is a good alternative to try-finally blocks which were introduced in Subsection 12.8. More details on the usage of the with statement can be found in the PEP 343 section of the official Python documentation at https://www.python.org/dev/peps/pep-0343/.
13.10 Getting the file name
The file class has an attribute name which can be used to obtain the filename:
13.11 A word about encoding, Latin-1, and UTF-8
In Subsections 4.26 – 4.28 we discussed the ASCII table which translates Latin text characters
and other widely used text symbols into decimal integer codes and vice versa. The
decimal integer codes are then translated into the corresponding binary numbers
(zeros and ones) before they are saved to the disk. For example, the ASCII text
string
becomes
Every 8-bit (1-Byte) segment in this sequence represents one ASCII character. For example, 01001000 is ’H’ etc.
The process of converting text into a binary sequence is called encoding. It only applies to text files, not to binary files. The above example, up to some technical details, describes Latin-1 (ISO-8859-1) encoding. This was the standard encoding for Western European languages until the advent of Unicode (UTF-8) encoding in 1991.
Unicode – which for simplicity can be understood as "extended ASCII" – was created in order to accommodate accents (ñ, ü, ř, ç, ...), Japanese, Chinese and Arabic symbols, and various special characters coming from other languages. While the original 8-bit Latin-1 (ISO-8859-1) only can encode 256 characters, Unicode (UTF-8) can encode 1,114,112. It uses a variable number of bits for various characters, trying to be as economical as possible. The exact scheme is beyond the scope of this textbook but you can find it on the web.
Importantly, UTF-8 is backward compatible with Latin-1 (ISO-8859-1) in the sense that UTF-8 codes of ASCII characters are the same as their Latin-1 (ISO-8859-1) codes. In other words, it does not matter whether UTF-8 or Latin-1 (ISO-8859-1) encoding is used for ASCII text.
Importantly, UTF-8 encoding is the standard in Python 3, so you don’t have to worry about using language-specific non-ASCII characters in your text strings. We will stop here for now, but feel free to learn more about encoding in the official Python documentation online.
13.12 Checking text encoding
Class file has an attribute encoding which can be used to obtain information about text
encoding:
The file class has a number of other useful attributes besides name and encoding. Feel free to learn more at https://docs.python.org/3/library/index.html.
13.13 Other modes to open a file
For reference, here is an overview of various modes to open a file in Python:
- ’r’: This is the default mode. It opens the file for reading as text (in UTF-8
encoding). Starts reading the file from the beginning. If the file is not found,
FileNotFoundError exception is raised. - ’rb’: Like ’r’ but opens the file for reading in binary mode.
- ’r+’: Opens the file for reading and writing. File pointer is placed at the beginning of the file (the file pointer will be discussed in Subsection 13.23).
- ’w’: This mode opens the file for writing text. If the file does not exist, it creates a new file. If the file exists, it truncates the file (erases all contents).
- ’wb’: Like ’w’ but opens the file for writing in binary mode.
- ’w+’: Like ’w’ but also allows to read from file.
- ’wb+’: Like ’wb’ but also allows to read from file.
- ’a’: Opens file for appending. Starts writing at the end of file. If the file does not exist, creates a new file.
- ’ab’: Like ’a’ but in binary format.
- ’a+’: Like ’a’ but also allows to read from file.
- ’ab+’: Like ’ab’ but also allows to read from file.
- ’x’: Creates a new file. If the file already exists, the operation fails.
13.14 Reading the file line by line using the for loop
Great, now let’s do something with our sample file "logo.txt"!
To begin with, we will just display its contents. Python has a convenient way of reading an
open text file one line at a time via the for loop. Adjust your sample Python file "test.py" as
follows:
Here line is not a keyword – it is a name for the text string representing the next line in the
file f, and we could use some other name if we wanted.
But when you run the code, you will see this strange output:
Well, that definitely does not look right! We will look into this mystery in the following
subsection. But first let’s make sure that indeed every line extracted from the file f is a text
string:
13.15 The mystery of empty lines
To solve the mystery of empty lines from the previous subsection, let’s use the built-in
function repr that you know from Subsection 4.10:
Oh, now this makes more sense! Each line extracted from the file (with the exception of the
last one) contains the newline character \n at the end, and the print function adds one more
by default. As a result, the newline characters are doubled, which causes the empty lines to
appear.
This problem can be cured in two different ways. First, we can prevent the print function
from adding the newline character \n after each line:
But this does not remove the newline characters from the text strings, which might cause
other problems. So let’s remove them using the string method rstrip from Subsection
4.11:
Note that we cannot use the string method strip here because some of the lines contain empty spaces on the left which we need to remain there.
13.16 Reading individual lines – method readline
The file class has a method readline which only reads one line from the file. This makes
reading from files more flexible. For example, one might want to only read the first three
lines:
Or, the file might have a two-line header that needs to be skipped:
To skip the header, one just needs to call readline two times before using the for
loop:
When the end of file is reached and there is no next line to read, readline returns an empty
string:
Finally let’s mention that one can call the method as f.readlines(n) where n is the number of bytes (characters) to be read from the file f. In other words, it is possible to parse the file f, reading just the first n characters from each line.
13.17 Reading the file line by line using the while loop
The method readline from the previous subsection can be combined with the while loop
as well:
Notice how if not line was used to detect that end of file was reached. This works because when the end of file is reached, line is an empty string, and if not applied to an empty string yields True (see Subsection 10.6).
13.18 Reading the file using next
The file class is iterable. Therefore it has the method __next__ and it can also be parsed with the built-in function next (see Subsections 9.13 - 9.16). Both can be used analogously to readline to access individual lines or parse the file line-by-line. This time we will enumerate the lines in the file "logo.txt".
First let’s do this using the method __next__:
Alternatively, one can use next:
13.19 Reading the file as a list of lines – method readlines
The file class has a method readlines which makes it possible to read the entire file at
once, and returns a list of lines (as text strings):
And there is an even shorter way to do this:
As you can see, in neither case were the text strings cleaned from the newline characters at the end. But there is something more important to say about this approach, which we will do in the following subsection.
13.20 File size vs available memory considerations
By RAM (Random-Access Memory) we mean runtime memory, or in other words the memory where the computer stores data while running programs. As opposed to hard disk space which is cheap, RAM is expensive, and therefore standard computers do not have much. Usually, as of 2018, standard desktop computers or laptops can have anywhere between 1 GB and 8 GB of RAM.
The file method readlines from the previous subsection will paste the entire contents of the file into the RAM, without worrying about the file size. So, one has to be careful here. If the size of the file is several GB, then there is a real possibility of running out of memory. Therefore, always be aware of the file size when working with files, and only use methods such as readlines when you are sure that the file is small.
Importantly, reading files with the for loop does not have this problem and it is safe even for very large files.
13.21 Reading the file as a single text string – method read
The file class has a method read which returns the entire contents of the text file as a single
text string:
Since the entire file is pasted into the RAM at once, this method obviously suffers from the same problems with large files as the method readlines. The method read will read the file from the current position of the file pointer until the end of the file. The method read accepts an optional integer argument which is the number of characters to be read. An example will be shown in Subsection 13.23.
13.22 Rewinding a file
Sometimes one needs to go twice through the contents of a file. To illustrate this, let’s create a new text file "ages.txt" in the folder "sandbox":
This sample file is very small, and therefore it can be pasted to RAM. But imagine a real-life
scenario where the file size is several GB, so you only can read it one line at a time. Your task
is to find the person (or multiple persons) with the highest age, and return their name(s) as a
list.
The easiest solution has three steps:
- Pass through the file f once to figure out the highest age ha.
- Rewind the file f. This can be done by calling f.seek(0).
- Make a second pass and collect all people whose age matches ha.
Typing f.seek(0) will move the file pointer to the initial position at the beginning of the
file f. We will talk about the file pointer and this method in more detail in the next
subsection. And here is the corresponding code:
# First get the highest age ’ha’:
ha = 0
for line in f:
line = line.rstrip()
name, age = line.split()
age = int(age)
if age > ha:
ha = age
# Rewind the file:
f.seek(0)
# Next collect all people whose age matches ’ha’:
L = []
for line in f:
line = line.rstrip()
name, age = line.split()
age = int(age)
if age == ha:
L.append(name)
print(’Highest age is’, ha)
print(’People:’, L)
13.23 File pointer, and methods read, tell and seek
On the most basic level, a text file is a sequence of characters, and in some cases one needs to work with this sequence on a character-by-character basis. The methods seek, read and tell serve this purpose. But before we show how they are used, let’s introduce the file pointer.
File pointer
The file pointer is an integer number which corresponds to the current position in the text file. When the file is opened for reading, the pointer is automatically set to 0. Then it increases by one with each new character read (or skipped).
Method read
From Subsection 13.21 you know that the method read accepts an optional integer argument which is the number of characters to be read. So, after calling f.read(6) the pointer will increase to 6. After calling once more f.read(3), it will increase to 12 etc.
Method tell
The position of the pointer in an open file f can be obtained by typing f.tell() (some
exceptions apply when the file is used as an iterator – see Subsection 13.24). In the
following example, we will read the first line from the file "ages.txt" from Subsection
13.23,
and look at the positions of the file pointer. More precisely, we will open the file and read the
first 6 characters, then 3 more, and finally the newline character \n at the end of the
line:
Method seek
You already know the method seek from Subsection 13.22. More precisely, you know that f.seek(0) will reset the position of the file pointer to 0, which effectively rewinds the file to the beginning. But this method has more uses:
- Typing f.seek(n) which is the same as f.seek(n, 0) will set the pointer in the file f to position n (counted from the beginning of the file).
- Typing f.seek(n, 1) will move the pointer forward by n positions (counted from the current position). This number can be negative, moving the file pointer backward.
- Typing f.seek(n, 2) will set the pointer to position n (counted from the end of the file backward).
For illustration, let’s improve the previous example and use f.seek(1, 1) to skip the empty
character between the words "Nathan" and "12":
As a last example, let’s illustrate using seek with a negative offset to move the file pointer
backward. We will read the first 9 characters in the file "ages.txt", then move the pointer back
by 2 positions, and read two characters again:
This is enough for the moment, and we will return to the method seek again in Subsection 13.28 in the context of writing to files.
13.24 Using tell while iterating through a file
The method tell is disabled when the file is used as an iterator (parsed using a for loop,
using the method __next__, or using the built-in function next). Here is an example where
tell behaves in an unexpected way because of this:
The reason is that a read-ahead buffer is used to increase efficiency. As a result,
the file pointer advances in large steps across the file as one iterates over lines.
One may try to circumvent this problem by avoiding the for loop, but unfortunately also
using readline disables tell:
In short, trying to combine tell with optimized higher-level file operations is pretty much
helpless. If you need to solve a task like this, the easiest way out is to define your own
function readline which is less efficient than the built-in version but does not disable
tell:
13.25 Another sample task for tell
Imagine that your next task is to look for a particular word (name, number, ...) in the file, and return the list of all lines where the word is present. But the result should not be a list of text strings – instead, it should be a list of pointer positions which correspond to the beginning of each line. To be concrete, let’s say that in the file "ages.txt" we must find the beginnings of all lines which contain the character ’1’.
The best way to solve this task is to remember the position of the pointer before reading
each line. But as you know from the previous subsection, this will be tricky because you will
not be able to use the built-in method readline. Again, the best solution is to use your own
method readlines (which was defined in the previous subsection). Here is the main
program:
13.26 Writing text to a file – method write
In Subsection 13.13 you have seen various modes which can be used to open a file for reading, writing or appending. Mode ’w’ will create a new text file if a file with the given name does not exist. If the file exists, it will be opened and truncated (file pointer set to the beginning).
When writing to a file, it matters where the Python source file is located, because the file will be created in the current directory (this is the same as when opening a file for reading – see Subsection 13.5). Of course one can use an absolute path but in our case this is not needed.
Let’s replace the code in the file "test.py" in folder "sandbox" with
After running the program, a new file named "greeting.txt" will appear in that folder:
Note that the write method does not add the newline character \n at the end of the text
string. The code
will result into a one line text file:
If we want to have multiple lines, we have to insert newline characters manually:
The result is now what one would expect:
13.27 Writing a list of lines to a file – method writelines
Staying with the previous example, the three lines can be written into the file elegantly using
a list and the method writelines. This method writes a list of text strings into the given
file:
A bit disappointingly, newline characters are still not added at the end of the lines:
But after adding the newline characters manually,
one obtains the desired result:
13.28 Writing to the middle of a file
In Subsection 13.23 we used the method seek to read from various parts of the file. But what happens if we want to write to some place in the middle of an open file? Will the new text be inserted or overwrite the text that is already there? We will show what will happen. Let’s begin with introducing a sample file "news.txt":
The name "Marconi" starts at position 51. Let’s open the file in the r+ mode, move the pointer
to this position, and write there Marconi’s first name, "Guglielmo ":
However, when the code is run, the contents of the file "news.txt" starting at position 51 is overwritten with the new text:
So this did not work. As a matter of fact, the text is a sequence of zeros and ones on the disk.
It behaves like a sentence written with a pencil on paper. You can erase and overwrite part of
it, but inserting new text in the middle is not possible. We will have to read the whole file as a
text string, slice it, insert the first name in the middle, and then write the new text string back
to the original file:
Finally, the desired result:
13.29 Things can go wrong: using exceptions
A number of things can fail when working with files. When reading files:
- The file you want to open may not exist.
- You may not have sufficient privileges to open the file.
- The file may be corrupted.
- You may run out of memory while reading the file.
When writing to files:
- The folder where you want to create the new file may not exist or be read-only.
- You may not have sufficient privileges to write to the folder.
- You may run out of disk space while writing data to a file, perhaps by using an infinite loop by mistake, etc.
The most frequent exceptions related to working with files are:
- FileExistsError is raised when trying to create a file or directory which already exists.
- FileNotFoundError is raised when a file or directory is requested but doesn’t exist.
- IsADirectoryError is raised when a file operation is requested on a directory.
- PermissionError is raised when trying to run an operation without the adequate access rights.
- UnicodeError is raised when a Unicode-related encoding or decoding error occurs.
- OSError is raised when a system function returns a system-related error, including I/O failures such as "file not found" or "disk full".
- MemoryError is raised when an operation runs out of memory.
It always is a good idea to put the code that works with the file into a try branch. (The
try-except statement including its full form try-except-else-finally was discussed
in Section 12.) An example:
Recall from Subsection 12.8 that optional else block can be added for code to be executed if no exceptions were raised, and a finally block for code to be executed always, no matter what happens in the previous blocks.
13.30 Binary files I – checking byte order
Binary files are suitable for storing binary streams (sequences of 0s and 1s) which not necessarily represent text – such as images or executable files. Before reading from or writing to a binary file, one needs to check the native byte order of the host platform. It can be either little-endian (bits are ordered from the little end = least-significant bit) or big-endian (bits are ordered from the big end = most-significant bit).
Big-endian is the most common format in data networking; fields in the protocols of the Internet protocol suite, such as IPv4, IPv6, TCP, and UDP, are transmitted in big-endian order. For this reason, big-endian byte order is also referred to as network byte order. Little-endian storage is popular for microprocessors, in part due to significant influence on microprocessor designs by Intel Corporation. We will not go into more detail here but you can easily find more information online.
To check the byte order, import the sys module and print sys.byteorder:
Hence in our case, the byte order is little-endian.
13.31 Binary files II – writing unsigned integers
Data written to binary files, even raw binary sequences, can always be represented
as unsigned (= positive) integers. For instance, it can be 8-bit (1-Byte) segments.
Or 2-Byte segments. Or entries consisting of one 2-Byte and one 4-Byte segment.
Let’s say that our data entries consist of a 2-Byte id and a 4-Byte value (the latter
scenario). To packetize the data into binary format, one needs to import the struct
module:
This module performs conversions between Python values and C structs represented as
Python bytes objects. It is described at the Python Standard Library documentation page
https://docs.python.org/3/library/struct.html.
Next, let’s open a binary file "datafile.bin" for writing, packetize a sample data pair id =
345, value = 6789 and write it to the file. The file will be automatically closed at the end
of the with statement:
Here, ’<HI’ is a formatting string where < means little-endian byte order (> would be
used for big-endian), H means a 2-Byte unsigned integer, and I means a 4-Byte
unsigned integer. The following table summarizes the most widely used unsigned data
formats:
Symbol | Python data type | Length in Bytes |
B | unsigned integer | 1 |
H | unsigned integer | 2 |
I | unsigned integer | 4 |
Q | unsigned integer | 8 |
For a complete list of symbols see the above URL.
Writing to and reading from binary files can be simplified using the Pickle module – this will be discussed in Subsection ??.
13.32 Binary files III – writing bit sequences
In the previous subsection we saw how to write to binary files unsigned integers of various
lengths (8, 16, 32 and 64 bits). Now, let’s say that we want to write 8-bit binary sequences.
These may be parts of a long binary sequence coming from an image, and not represent
numbers at all. But for the sake of writing to a binary file, we will convert them into unsigned
integers. This is easy. For instance, the bit sequence ’11111111’ represents unsigned
decimal integer 255:
The value 2 in int(binseq, 2) stands for base-2 (binary) number. Using the code from the
previous subsection, an 8-bit binary sequence binseq can be written to a binary file
"datafile.bin" as follows:
Here < means little-endian byte order and B stands for 1-Byte (8-bit) unsigned
integer.
As another example, let’s write a 32-bit sequence 1101111011000111011000111001
1011:
Here I stands for 4-Byte (32-bit) unsigned integer. On the way, the binary sequence was converted into an integer value 3737609115 (which really does not matter).
Finally, the binary sequence may be coming not as a text string but directly as a base-2
integer, such as val=0b11011110110001110110001110011011. In this case, the
conversion step using int is not needed, and one can save the number to the file
directly:
Obviously, a for loop can be used to write as many 8-bit, 16-bit, 32-bit or 64-bit sequences as needed.
13.33 Binary files IV – reading byte data
Let’s get back to Subsection 13.31 where we created a binary file "datafile.bin" and wrote two
unsigned integers id = 345 (2 Bytes) and value = 6789 (4 Bytes) to it. The two
unsigned integers can be retrieved from the file by opening it for binary reading, read
the 2-Byte and 4-Byte byte strings, and convert them into unsigned integers as
follows:
Note: We are using the fact that our system is little-endian (see Subsection 13.30). When the
file is open for binary reading using the flag ’rb’, the method read returns the so-called
byte string. Typing read(1) returns a byte string of length 1 Byte, read(2) of length 2 Bytes
etc. The byte string is not a regular text string. One needs to know what the byte string
represents in order to decode it correctly. In our case, we knew that both byte strings
represented unsigned integers, therefore the method from_bytes of class int could be
used to decode it.
Now let’s get back to the example from Subsection 13.32 where we saved the binary sequence
’11011110110001110110001110011011’ to the file "datafile.bin". Here is how to read it
back:
Table of Contents
- Preface
- 1. Introduction
- 2. Using Python as a Scientific Calculator
- 3. Drawing, Plotting, and Data Visualization with Matplotlib
- 4. Working with Text Strings
- 5. Variables and Types
- 6. Boolean Values, Functions, Expressions, and Variables
- 7. Lists, Tuples, Dictionaries, and Sets
- 8. Functions
- 9. The ’For’ Loop
- 10. Conditions
- 11. The ’While’ Loop
- 12. Exceptions
- 13. File Operations
- 14. Object-Oriented Programming I – Introduction
- 15. Object-Oriented Programming II – Class Inheritance
- 16. Object-Oriented Programming III – Advanced Aspects
- 17. Recursion
- 18. Decorators
- 19. Selected Advanced Topics