Python project

Overview

Teaching: 40 min
Exercises: 20 min
Questions
  • How do I make a python package?

  • What is the best way to setup a package?

  • Can I include data in my python package?

Objectives
  • Create a python package

  • Make the package easy to install

Parts of a python package

The two main parts of a python package are the executable scripts and the python modules.

The executable scripts are things that you run on the command line are installed by appending them to the$PATH environment variable so they can be run anywhere. Executable scripts are usually short (only tens of lines of code) and call functions from the python modules. They are most useful for automating tasks that you do regularly.

The python modules are python functions and classes that can be imported into other python scripts by appending them to the $PYTHONPATH environment variable. Python modules are usually longer (hundreds of lines of code) and are used to implement the core functionality of your package. They are useful to make new python code as they can be imported into other scripts. Being able to only use parts of your code makes it more flexible and more useful for others.

The template we will be working toward

We have created a template with the basic structure of a python package. This can be used a starting point for your own python packages and has some of the documentations and tests set up so you can see how they work and edit them for your purposes. We will slowly work towards a complete version of this template over the course of the workshop.

Creating a simple python package

To ensure that we understand the basics of the python package structure we will create a simple package the prints “Hello World!” to the screen.

First we shall make a function that prints “Hello World!” to the screen. We will put this in a directory called my_package in a file called my_file. We will also need to create a file called __init__.py in the my_package directory to tell python that this is a python module.

mkdir my_package
cd my_package
touch my_file.py
touch __init__.py

Next we will add the following code to my_file.py:

def hello_world():
    print("Hello World!")

This is our python module set up, we will now create an executable script that calls this function.

We will create a file called hello_world.py in the my_package/scripts directory.

mkdir scripts
cd scripts
touch hello_world.py
touch __init__.py

Next we will add the following code to hello_world.py:

from my_package.my_file import hello_world

def main():
    hello_world()

if __name__ == "__main__":
    main()

This is our executable script set up, we will now make a setup.py it so that we can run it from anywhere.

cd ../../
touch setup.py

Next we will add the following code to setup.py:

'''
Setup for my_package that describes it's contents and how to install it.
'''
from setuptools import setup

# List of dependencies for the package
# >= Can be used to specify a minimum version
# >=,< Can be used to specify a minimum and maximum version e.g. 'numpy>=1.15,<1.20',
dependencies = [
    'argparse>=1.4.0',
    'numpy>=1.15',
    'matplotlib>=2.1.0',
    'astropy>=2.0.2',
]

setup(
    # Name of the repository
    name='my_project',
    # Version of the software (keep this up to date with git tags/releases)
    version=1.0,
    # Short description of the package
    description='An example package to be used as a template',
    # Update this with your github URL
    url='https://github.com/ADACS-Australia/python_project_template',
    # Minimum version of python required
    python_requires='>=3.6',
    # Name of your package, should be the same as the name of the directory
    packages=['my_package', 'my_package.scripts'],
    # Any data files that need to be included with the package (non python files)
    package_data={'my_package':['data/*.csv']},
    # Dependencies for the package
    install_requires=dependencies,
    # Scripts that will be run on the command line
    entry_points={
        'console_scripts': [
            # Make a hello_world command that runs the main function in hello_world.py script located in my_package/scripts
            'hello_world=my_package.scripts.hello_world:main',
        ],
    },
)

This actually includes a bit more than we need for our simple script but we will use it as a template for future packages. To confirm all the files are in the correct place we can run:

tree

Which should output:

.
├── my_package
│   ├── __init__.py
│   ├── my_file.py
│   └── scripts
│       ├── __init__.py
│       └── hello_world.py
└── setup.py

We can now install our package by running:

pip install .

We can now run (from any directory) our script by running:

hello_world

Congratulations you have made your first python package!

Commit your changes!

Turning our initial script into a python package

Now that we have understand the parts of a python package we will turn our initial script into a python package. We will split the main two tasks of out initial script into two functions, one that reads in the data and one that plots it, then create an executable scripts that calls these functions.

First we will create a module file called my_package/data_processing.py that contains the following code:

import pandas as pd
from astropy.coordinates import SkyCoord
import astropy.units as u

def input_data(csv_path):
    # Read the CSV file into a DataFrame
    df = pd.read_csv(csv_path)

    # Add new columns to df
    df['RA (deg)']  = 0
    df['Dec (deg)'] = 0

    # Loop over dataframe and convert RA and Dec to degrees
    for index, row in df.iterrows():
        ra_hms  = row['RA (HMS)']
        dec_hms = row['Dec (DMS)']

        # Create a SkyCoord object and specify the units
        c = SkyCoord(ra=ra_hms, dec=dec_hms, unit=(u.hourangle, u.deg))
        # Access the converted RA and Dec in degrees
        ra_deg  = c.ra.deg
        dec_deg = c.dec.deg

        # Update the DataFrame with the converted values
        df.at[index, 'RA (deg)']  = ra_deg
        df.at[index, 'Dec (deg)'] = dec_deg

    return df

This was as easy as copying the code from our initial script into a function, declaring csv_path as an input and df as an output. It is a good habit to use docstrings to describe your functions, but we will go over that later in the documentation section.

Next we will create another module file called my_package/plotting.py that contains the plotting functions. We create a new module file for plotting as it is a different type of task to reading in the data and it is good practice to separate different tasks into different modules to keep your package organised.

import matplotlib.pyplot as plt
from numpy import radians

def molleweide_plot(df):
    # Plot the pulsars
    fig = plt.figure(figsize=(6, 4))
    # Molleweide projection gives us a nice view of the whole sky
    ax = plt.axes(projection='mollweide')
    plt.grid(True, color='gray', lw=0.5, linestyle='dotted')
    ax.set_xticklabels(['22h', '20h', '18h', '16h', '14h','12h','10h', '8h', '6h', '4h', '2h'])

    # Convert RA and Dec to radians and plot
    ax.scatter(radians(-df['RA (deg)'] + 180), radians(df['Dec (deg)']), s=0.2)
    plt.savefig("pulsar_plot.png", dpi=300, bbox_inches='tight')

We can now create a new script called my_package/scripts/filter_and_plot.py that will call these functions:

from my_package.data_processing import input_data
from my_package.plotting import molleweide_plot


def main():
    print("Reading in the data")
    df = input_data("pulsars.csv")

    print("Plotting the data")
    molleweide_plot(df)


if __name__ == "__main__":
    main()

Finally, to be able to install this as an executable script we must add the following to the setup.py file:

setup(
    #...
    entry_points={
        'console_scripts': [
            'hello_world=my_package.scripts.hello_world:main',
            # Above is test script and below is new script that must be added
            'filter_and_plot=my_package.scripts.filter_and_plot:main',
        ],
    },
    #...
)

We can now install our package by running:

pip install .

And run our script by running:

filter_and_plot

Huzzah! We now have an installable script that we can run from any directory and that we can share with others. Don’t forget to commit your changes BUT, we are not quite done yet.

Using an argument parser

We can improve our script by adding an argument parser. This will allow us to specify the path to the CSV file as an argument when we run the script. We will use the argparse module to do this.

First we will add the following to the my_package/scripts/filter_and_plot.py script:

import argparse

from my_package.data_processing import input_data #, filter_by_name, filter_by_declination
from my_package.plotting import molleweide_plot


def main():
    # Create an argument parser so users can understand and use command line options
    parser = argparse.ArgumentParser(description='Process and filter an input CSV file and create a sky plot of the sources.')

    # Add arguments to the parser
    parser.add_argument(
        # A short version of the argument which is normally a single dash and a single letter
        '-i',
        # A long version of the argument which is normally two dashes and a word
        '--input',
        # The type of the argument that will be converted when it is read in
        type=str,
        # A description of what the argument does
        help='The path to the input csv file. Default: "pulsars.csv"',
        # The default value of the argument if none is provided
        default="pulsars.csv",
    )

    # Parse the arguments
    args = parser.parse_args()

    # Read in the data
    df = input_data(args.input)

    print("Plotting the data")
    molleweide_plot(df)

if __name__ == "__main__":
    main()

Now we can run our script with the -h or --help flag to see the help message:

filter_and_plot -h
usage: filter_and_plot [-h] [-i INPUT]

Process and filter an input CSV file and create a sky plot of the sources.

options:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        The path to the input csv file. Default: "pulsars.csv"

So if you ever forgot what the script does or what it’s options are, you can now rely on the help message to remind you. As we add more features, we can add more arguments to the parser to make our script more flexible.

So for example if the CSV file is in a different directory we can now run:

filter_and_plot -i /path/to/my/csv/file.csv

Adding data files

That’s all pretty cool but what if we want to add some data files to our package so we don’t have to download them and point to them with environment variables? We can do this by adding a package_data argument to the setup.py file:

setup(
    #...
    package_data={
        'my_package': ['data/*.csv'],
    },
    #...
)

This will include all csv files in the my_package/data directory in the package when we run pip install .. Remember to make this directory and move the pulsars.csv file into it. One easy what to then access the files is to create a a module called my_package/load_data.py that contains the following:

import os

# Path to the pulsar CSV data file relative to the current file location
PULSAR_CSV_PATH = os.path.join(os.path.dirname(__file__), 'data/pulsars.csv')

We can then import this module and use the PULSAR_CSV_PATH variable to access the data file. The standard practice is to name this variable in capitals to indicate that it is a constant used outside of function scopes (won’t change and is not used as an input to a function). So for example we can set it as the default value for the input argument in the my_package/scripts/filter_and_plot.py script:

import argparse

from my_package.data_processing import input_data #, filter_by_name, filter_by_declination
from my_package.plotting import molleweide_plot
from my_package.load_data import PULSAR_CSV_PATH


def main():
    # Create an argument parser so users can understand and use command line options
    parser = argparse.ArgumentParser(description='Process and filter an input CSV file and create a sky plot of the sources.')

    # Add arguments to the parser
    parser.add_argument(
        # A short version of the argument which is normally a single dash and a single letter
        '-i',
        # A long version of the argument which is normally two dashes and a word
        '--input',
        # The type of the argument that will be converted when it is read in
        type=str,
        # A description of what the argument does
        help='The path to the input csv file. If none is provided the default pulsar data will be used.',
        # The default value of the argument if none is provided
        default=PULSAR_CSV_PATH,
    )
    #...

So now we can run the script without any arguments and it will use the default data file

Key Points

  • What an executable script is

  • What a python package and module is

  • How to include data in your package