Python: List Files in a Directory
I prefer to work with Python because it is a very flexible programming language, and allows me to interact with the operating system easily. This also includes file system functions. To simply list files in a directory the modules os
, subprocess
, fnmatch
, and pathlib
come into play. The following solutions demonstrate how to use these methods effectively.
Using os.walk()
The os
module contains a long list of methods that deal with the filesystem, and the operating system. One of them is walk()
, which generates the filenames in a directory tree by walking the tree either top-down or bottom-up (with top-down being the default setting).
os.walk()
returns a list of three items. It contains the name of the root directory, a list of the names of the subdirectories, and a list of the filenames in the current directory. Listing 1 shows how to write this with only three lines of code. This works with both Python 2 and 3 interpreters.
Listing 1: Traversing the current directory using os.walk()
import os
for root, dirs, files in os.walk("."):
for filename in files:
print(filename)
Using the Command Line via Subprocess
Note: While this is a valid way to list files in a directory, it is not recommended as it introduces the opportunity for command injection attacks.
As already described in the article Parallel Processing in Python, the subprocess
module allows you to execute a system command, and collect its result. The system command we call in this case is the following one:
Example 1: Listing the files in the current directory
$ ls -p . | grep -v /$
The command ls -p .
lists directory files for the current directory, and adds the delimiter /
at the end of the name of each subdirectory, which we’ll need in the next step. The output of this call is piped to the grep
command that filters the data as we need it.
The parameters -v /$
exclude all the names of entries that end with the delimiter /
. Actually, /$
is a Regular Expression that matches all the strings that contain the character /
as the very last character before the end of the string, which is represented by $
.
The subprocess
module allows to build real pipes, and to connect the input and output streams as you do on a command line. Calling the method subprocess.Popen()
opens a corresponding process, and defines the two parameters named stdin and stdout.
Listing 2 shows how to program that. The first variable ls
is defined as a process executing ls -p .
that outputs to a pipe. That’s why the stdout channel is defined as subprocess.PIPE
. The second variable grep
is defined as a process, too, but executes the command grep -v /$
, instead.
To read the output of the ls
command from the pipe, the stdin channel of grep
is defined as ls.stdout
. Finally, the variable endOfPipe
reads the output of grep
from grep.stdout
that is printed to stdout element-wise in the for
-loop below. The output is seen in Example 2.
Listing 2: Defining two processes connected with a pipe
import subprocess
# define the ls command
ls = subprocess.Popen(["ls", "-p", "."],
stdout=subprocess.PIPE,
)
# define the grep command
grep = subprocess.Popen(["grep", "-v", "/$"],
stdin=ls.stdout,
stdout=subprocess.PIPE,
)
# read from the end of the pipe (stdout)
endOfPipe = grep.stdout
# output the files line by line
for line in endOfPipe:
print (line)
Example 2: Running the program
$ python find-files3.py
find-files2.py
find-files3.py
find-files4.py
...
This solution works quite well with both Python 2 and 3, but can we improve it somehow? Let us have a look at the other variants, then.
Combining os
and fnmatch
As you have seen before the solution using subprocesses is elegant but requires lots of code. Instead, let us combine the methods from the two modules os
, and fnmatch
. This variant works with Python 2 and 3, too.
As the first step, we import the two modules os
, and fnmatch
. Next, we define the directory we would like to list the files using os.listdir()
, as well as the pattern for which files to filter. In a for
loop we iterate over the list of entries stored in the variable listOfFiles
.
Finally, with the help of fnmatch
we filter for the entries we are looking for, and print the matching entries to stdout. Listing 3 contains the Python script, and Example 3 the corresponding output.
Listing 3: Listing files using os and fnmatch module
import os, fnmatch
listOfFiles = os.listdir('.')
pattern = "*.py"
for entry in listOfFiles:
if fnmatch.fnmatch(entry, pattern):
print (entry)
Example 3: The output of Listing 3
$ python2 find-files.py
find-files.py
find-files2.py
find-files3.py
...
Using os.listdir()
and Generators
In simple terms, a generator is a powerful iterator that keeps its state. To learn more about generators, check out one of our previous articles, Python Generators.
The following variant combines the listdir()
method of the os
module with a generator function. The code works with both versions 2 and 3 of Python.
As you may have noted before, the listdir()
method returns the list of entries for the given directory. The method os.path.isfile()
returns True
if the given entry is a file. The yield
operator quits the function but keeps the current state, and returns only the name of the entry detected as a file. This allows us to loop over the generator function (see Listing 4). The output is identical to the one from Example 3.
Listing 4: Combining os.listdir()
and a generator function
import os
def files(path):
for file in os.listdir(path):
if os.path.isfile(os.path.join(path, file)):
yield file
for file in files("."):
print (file)
Use pathlib
The pathlib
module describes itself as a way to “Parse, build, test, and otherwise work on filenames and paths using an object-oriented API instead of low-level string operations”. This sounds cool – let’s do it. Starting with Python 3, the module belongs to the standard distribution.
In Listing 5, we first define the directory. The dot (“.”) defines the current directory. Next, the iterdir()
method returns an iterator that yields the names of all the files. In a for
loop we print the name of the files one after the other.
Listing 5: Reading directory contents with pathlib
import pathlib
# define the path
currentDirectory = pathlib.Path('.')
for currentFile in currentDirectory.iterdir():
print(currentFile)
Again, the output is identical to the one from Example 3.
As an alternative, we can retrieve files by matching their filenames by using something called a glob. This way we can only retrieve the files we want. For example, in the code below we only want to list the Python files in our directory, which we do by specifying “*.py” in the glob.
Listing 6: Using pathlib
with the glob
method
import pathlib
# define the path
currentDirectory = pathlib.Path('.')
# define the pattern
currentPattern = "*.py"
for currentFile in currentDirectory.glob(currentPattern):
print(currentFile)
Using os.scandir()
In Python 3.6, a new method becomes available in the os
module. It is named scandir()
, and significantly simplifies the call to list files in a directory.
Having imported the os
module first, use the getcwd()
method to detect the current working directory, and save this value in the path
variable. Next, scandir()
returns a list of entries for this path, which we test for being a file using the is_file()
method.
Listing 7: Reading directory contents with scandir()
import os
# detect the current working directory
path = os.getcwd()
# read the entries
with os.scandir(path) as listOfEntries:
for entry in listOfEntries:
# print all entries that are files
if entry.is_file():
print(entry.name)
Again, the output of Listing 7 is identical to the one from Example 3.
Conclusion
There is disagreement which version is the best, which is the most elegant, and which is the most “pythonic” one. I like the simplicity of the os.walk()
method as well as the usage of both the fnmatch
and pathlib
modules.
The two versions with the processes/piping and the iterator require a deeper understanding of UNIX processes and Python knowledge, so they may not be best for all programmers due to their added (and unnecessary) complexity.
To find an answer to which version is the quickest one, the timeit
module is quite handy. This module counts the time that has elapsed between two events.
To compare all of our solutions without modifying them, we use a Python functionality: call the Python interpreter with the name of the module, and the appropriate Python code to be executed. To do that for all the Python scripts at once a shell script helps (Listing 8).
Listing 8: Evaluating the execution time using the timeit
module
#! /bin/bash
for filename in *.py; do
echo "$filename:"
cat $filename | python3 -m timeit
echo " "
done
The tests were taken using Python 3.5.3. The result is as follows whereas os.walk()
gives the best result. Running the tests with Python 2 returns different values but does not change the order – os.walk()
is still on top of the list.
Method | Result for 100,000,000 loops |
---|---|
os.walk | 0.0085 usec per loop |
subprocess/pipe | 0.00859 usec per loop |
os.listdir/fnmatch | 0.00912 usec per loop |
os.listdir/generator | 0.00867 usec per loop |
pathlib | 0.00854 usec per loop |
pathlib/glob | 0.00858 usec per loop |
os.scandir | 0.00856 usec per loop |
Acknowledgements
The author would like to thank Gerold Rupprecht for his support, and comments while preparing this article.