Lessons learned building with Django, Celery, and Pytest

September 17, 2021

Tags: learning, python • Categories: Learning

Table of Contents

As someone who writes ruby professionally, I recently learned python to build a bot which buys an index of crypto using binance.

The best thing about ruby is Rails, so I wanted an excuse to try out Django and see how it compared. Adding multi-user mode to the crypto bot felt like a good enough excuse. My goal was to:

Add a model for the user that persisted to a database
Cron job to kick off a job for each user, preferably using a job management library
Add some tests for primary application flows
Docker-compose for the DB and app admin

I’ll detail learnings around Docker in a separate post. In this post, I walk through my raw notes as I dug into the django + python ecosystem further.

(I’ve written some other learning logs in this style if you are interested)

Open Source Django Projects

I found a bunch of mature, open-source django projects that were very helpful to grep (or, ripgrep) through. Clone these into a ~/Projects/django folder so you can easily search through them locally when learning:

https://github.com/getsentry/sentry
https://github.com/arrobalytics/django-ledger
https://github.com/intelowlproject/IntelOwl
https://github.com/mdn/kuma – manages the MDN docs
https://github.com/apache/airflow
https://github.com/kiwicom/kiwi-structlog-config – Advanced structlog configuration examples.

More python language learnings

I learned a bunch more about the core python language. I was using the most recent (3.9) version of python at the time.

You can setup imports in __init__ to make it more convenient for users to import from your package.
As of python3, you don’t need a __init__ within a folder to make it importable.
You can import multiple objects in a single statement from sentry.db.models import (one, two, three)
iPython can be setup to automatically reload modified code.
Somehow VS Code’s python.terminal.activateEnvironment got enabled again. This does not seem to play well with poetry’s venv. I disabled it and it eliminated some weird environment stuff I was running into.
When using poetry, if you specify a dependency with path in your toml, even if it’s the dev section, it still is referenced and validated when running poetry install. This can cause issues when building dockerfiles for production when still referencing local copies of a package you are modifying.
It doesn’t seem like there is a way to force a non-nil value in mypy. If you are getting typing errors due to nil values assert var is not None or t.cast are the best options I found.
Inline return with a condition is possible: if not array_of_dicts: return None
There doesn’t seem to be a one-command way to install pristine packages. poetry env remove python && poetry env use python && poetry install looks like the best approach. I ran into this when I switched a package to reference a github branch; the package was already installed and poetry wouldn’t reinstall it from the github repo.
You can copy/paste functions into a REPL with iPython, but without iPython enabled it’s very hard to copy/paste multiline chunks of code. This is a good reason to install iPython in your production deployment: makes repl debugging in production much easier.
By default all arguments can be either keyword or positional. However, you can define certain parameters to be positional-only using a / in the function definition.
Variable names cannot start with numbers. This may seem obvious, but when you are switching from using dicts to TypedDict you may have keys which start with a number that will only cause issues when you start to construct TypedDict instances.
There is not a clean way to update TypedDicts. Looks like the easiest way is to create a brand new one or type cast a raw updated dict.
Cast a union list of types to a specific type with typing.cast.
Convert a string to a enum via EnumClassName('input_string') as long as your enum has str as one of its subclasses.
Disable typing for a specific line with # type: ignore as an inline comment
Memoize a function by specifying a variable as global and setting a default value for that variable within the python file the function is in. There is also @functools.cache included with stdlib for this that should work in most situations.
mypy is a popular type checker, but there’s also pyright which is installed by default with pylance (VS Code’s python extension).
pylint seems like the best linter, although I was surprised at how many different options there were. This answer helped me get it working with VS Code.
Magic methods (i.e. __xyz__) are also called dunder methods.
A ‘sentinel value is used to distinguish between an intentional None value and a value that indicates a failure, cache miss, no object found, etc. Think undefined vs null in Javascript. First time I heard it used to describe this pattern.
The yield keyword is interesting. It returns the value provided, but the state of the function is maintained and somehow wrapped in a returned iterator. Each subsequent next will return the value of the next yield in the logic
Unlike ruby, it does not seem possible to add functions to the global namespace. This is a nice feature; less instances of ‘where is this method coming from.
Black code formatting is really good. I thought I wouldn’t like it, but I was wrong. The cognitive load it takes off your mind when you are writing code is more than I would have expected.

Structured logging with context & ENV-customized levels

structlog is a really powerful package, but the documentation is lacking and was hard to configure. Similar to my preferred ruby logger I wanted the ability to:

Set global logging context
Easily pass key/value pairs into the logger
Configure log level through environment variables

Here’s the configuration which worked for me:

# utils.py
import structlog
from decouple import config
from structlog.threadlocal import wrap_dict

def setLevel(level):
    level = getattr(logging, level.upper())
    structlog.configure(
        # context_class enables thread-local logging to avoid passing a log instance around
        # https://www.structlog.org/en/21.1.0/thread-local.html
        context_class=wrap_dict(dict),
        wrapper_class=structlog.make_filtering_bound_logger(level),
        cache_logger_on_first_use=True,
    )

log_level = config("LOG_LEVEL", default="WARN")
setLevel(log_level)

log = structlog.get_logger()

To add context to the logger and log a key-value pair

from utils import log
log.bind(user_id=user.id)
log.info("something", amount=amount)

Django

poetry add django to your existing project to get started.
Then, poetry shell and run django-admin startproject thename to setup the project
Django has an interesting set of bundled apps: activeadmin-like
Swap the DB connection information in settings.py to use PG and poetry add psycopg2.
Django will not create the database for you, so you need to run CREATE DATABASE <dbname>; to add it before running your migrations.
The default configuration does not pull from your ENV variables. I’ve written a section below about application configuration; it was tricky for me coming from rails.
django-extensions is a popular package that includes a bunch of missing functionality from the core django project. Some highlights: shell_plus, reset_db, sqlcreate.
It doesn’t look like there are any generators, unlike rails or phoenix.
Asset management is not included. There’s a host of options you can pick from.
There’s a full-featured ORM with adaptors to multiple DBs. Here are some tips and tricks:
- There’s a native JSONField type which is compatible with multiple databases. Uses jsonb under the hood when postgres is in place.
- After you’ve defined a model, you autogen the migration code and then run the migrations.
- python manage.py makemigrations
- Then, to migrate: python manage.py migrate
- To get everything: User.objects.all() or User.objects.iterator() to page through them.
- Getting a single object: User.objects.get(id=1)
- Use save() on an object to update or create it
- Create an object in a single line using User.objects.create(kwargs)
You need a project (global config) and apps (actual code that makes up the core of your application)
It looks like django apps (INSTALLED_APPS) are sort of like rails engines, but much more lightweight.
Apps can each have their own migrations and they are not stored in a global folder. For instance, the built-in auth application has a bunch of migrations that will run but are not included in your application source code. I found this confusing.
Table names are namespaced based on which app the model is in. If you have a user model in a users app the table will be named users_user.
It looks like there is a unicorn equivalent, gunicorn, that is the preferred way for running web workers. It’s not included or configured by default.
Flask is a framework similar to sinatra: simple routing and rendering web framework.
The app scaffolding is very lightweight. Views, models, tests, and admin UI has a standard location. Everything else is up to the user.
There’s a caching system built into django, but it doesn’t support redis by default. I already have redis in place, so I don’t want to use the default adapter (memcache). There’s a package django-redis that adds redis support to django cache.
django-extensions has a nifty SHELL_PLUS_PRE_IMPORTS = [("decimal", "Decimal")] setting that will auto-import additional packages for you. It’s annoying to have to import various objects just to poke around in the REPL, and this setting eliminates this friction.

Use decimal objects for floats when decoding JSON

In my case, I needed to use Decimals instead of floats everywhere to avoid floating point arithmetic inaccuracies. Even $0.01 difference could cause issues when submitting orders to the crypto exchange.

This is really easy when parsing JSON directly:

requests.get(endpoint).json(parse_float=decimal.Decimal),

If you are using a JSONField to store float values, it gets more complicated. You can’t just pass parse_float to the JSONField constructor. A custom decoder must be created:

class CustomJSONDecoder(json.JSONDecoder):
    def __init__(self, *args, **kwargs):
        from decimal import Decimal

        kwargs["parse_float"] = Decimal
        super().__init__(*args, **kwargs)

class YourModel(models.Model):
    the_field = models.JSONField(default=dict, decoder=CustomJSONDecoder)

Multiple django environments

There is not a standard way of managing different environments (staging, development, test, prod) in django. I found this very confusing and wasted time attempting to figure out what the best practice was here.

Here are some tips & recommendations:

Django doesn’t come with the ability to parse database URLs. There’s an extension, dj_database_url, for this.
Poetry has a built-in dev category, which can be used for packages only required for development and test packages. There are no separate test or development groups.
python-dotenv seems like the best package for loading a .env file into os.environ. However, if you are building an application with multiple entrypoints (i.e. web, cli, repl, worker, etc) this gets tricky as you need to ensure load_dotenv() is called before any code which looks at os.environ.
After attempting to get python-dotenv working for me, I gave decouple a shot. It’s much better: you use it’s config function to extract variables from the environment. That function ensures that .env is loaded before looking at your local os.environ. Use this package instead.
By default, Django does not setup your settings.py to pull from the environment. You need to do this manually. I included some snippets below.
After getting decouple in place, you’ll probably want separate configurations for different environments. The best way to do this is to set DJANGO_SETTINGS_MODULE to point to a completely separate configuration file for each environment.
- In your toml you can set settings path [tool.pytest.ini_options] DJANGO_SETTINGS_MODULE = "app.settings.test" to force a different environment for testing.
- In production, you’ll set the DJANGO_SETTINGS_MODULE to app.settings.production in the docker or heroku environment
- For all other environments, you’ll set DJANGO_SETTINGS_MODULE to app.settings.development in your manage.py
- In each of these files (app/settings/development.py, app/settings/test.py, etc) you’ll from .application import * and store all common configuration in app/settings/application.py.
- Here’s a working example.

Here’s how to configure django cache and celery to work with redis:

CACHES = {
    "default": {
        "BACKEND": "django_redis.cache.RedisCache",
        "LOCATION": config("REDIS_URL"),
        "OPTIONS": {
            "CLIENT_CLASS": "django_redis.client.DefaultClient",
        },
    }
}

Here’s how to use dj_database_url with decouple:

DATABASES = {"default": dj_database_url.parse(config("DATABASE_URL"))}

Job management using Celery

Django does not come with a job queue. Celery is the most popular job queue library out there and requires redis. It looks like it will require a decent amount of config, but I chose to use it anyway to understand how it compared to Sidekiq/Resque/ActiveJob/Oban/etc.

poetry add celery --allow-prereleases (I needed a prerelease to work with the version of click I was using)
If you are using redis as the broker (easier for me, since I already had it installed + running) you’ll need to poetry add redis
Celery does not use manage.py so it would not load the .env file. I needed to manually run dotenv_load() at the top of your celery config. I discovered that this needed to be conditionally loaded for prod, at which point I discovered that decouple is a much better package for managing configuration.
I put my celery tasks within the users application as tasks.py. You can specify a dot-path to the celery config via the CLI: celery -A users.tasks worker --loglevel=INFO
You can configure celery to store results. If you do, you are responsible for clearing out results. They do not expire automatically.
Celery has a built-in cron scheduler. Very nice! There’s even a nice -B option for running the scheduler within a single worker process (not recommended for prod, but nice for development).
When I tried to access django models, I got some weird errors. There’s a django-specific setup process you need to run through.
DJANGO_SETTINGS_MODULE needs to be set, just like in manage.py. You can’t import django-specific modules at the top of the celery config file.
Celery is threaded by default. If your code is not thread safe, you’ll need to set --concurrency=1.
By default, tasks do not run inline. If you want to setup an integration test for your tasks, you need either (a) run tasks in eager mode (not recommended) or (b) setup a worker thread to run tasks for you during your tests.
Eager mode is not recommended for testing, since it doesn’t simulate the production environment as closely. However, running a worker thread introduces another set of issues (like database cleanup not working properly).
There’s no real downside to using @shared_task instead of @app.task. It’s easier to do this from the start: less refactoring to do when your application grows.

Testing

Some more learnings about working with pytest & vcr in combination with django:

Database cleaning is done automatically for you via @pytest.mark.django_db at the top of your test class. This is great: no need to pull in a separate database cleaner.
To be able to run pytest which relies on django models/configuration, you need the pytest-django extension.
You can stick any config that would be in pytest.ini in your toml file under [tool.pytest.ini_options]
You need to setup a separate config for your database to ensure it doesn’t use the same one as your development environment. The easiest way to do this is to add DJANGO_SETTINGS_MODULE = "yourapp.settings.test" to your toml file and then override the database setup in the yourapp/settings/test.py file.
You can use pytest fixtures to implement ruby-style around functions.
Redis/django cache is not cleared automatically between test runs. You can do this manually via django.core.cache.clear()
In a scenario where you memoize/cache a global function that isn’t tied to a class, you may need to clear the cache to avoid global state causing indeterminate test results. You can do this for a single method via clear_cache() or identify all functions with lru cache and clear them.
Django has a test runner (python manage.py test). It seems very different (doesn’t support fixtures), and I ran into strange compatibility issues when using it. Use pytest instead.

My thoughts on Django

I continue to be impressed with the python ecosystem. The dev tooling (linting, repls, type checking, formatting, etc) is robust, there are reasonably well-written and maintained packages for everything I needed. It seems as though most packages are better maintained than the ruby equivalents. I only once had to dive into a package and hack a change I needed into the package. That’s pretty impressive, especially since the complexity of this application grew a lot more than I expected.

Working with python is just fun and fast (two things that are very important for me!). A similar level of fun to ruby, but the language is better designed and therefore easy to read. You can tell the ecosystem has more throughput: more developers are using various packages, and therefore more configuration options and bugs worked out. This increases dev velocity which matters a ton for a small side project and even more for a small startup. I don’t see a reason why I’d use ruby if I’m not building a rails-style web application.

Rails is ruby’s killer app. It’s better than Django across a couple of dimensions:

Better defaults.
Multiple environments supported out of the box.
Expansive batteries-included components. Job queuing, asset management, web workers, incoming/outgoing email processing, etc. This is the biggest gap in my mind: it takes a lot more effort & decisions to get all of these components working.
Since django takes a ‘bring your own application components’ approach, you don’t get the benefit of large companies like Shopify, GitHub, etc using these and working out all of the bugs for you.

The Django way seems to be a very slim feature set that can be easily augmented by additional packages. Generally, I like the unix-style single responsibility tooling, but in my experience, the integration + maintenance cost of adding 10s of packages is very high. I want my web framework to do a lot for me. Yes, I’m biased, since I’m used to rails but I do think this approach is just better for rapid application development.

This was a super fun project. Definitely learned to love python and appreciate the Django ecosystem.

What I’m missing

There were some things I missed from other languages, although the list is pretty short and nitpicky:

Source code references within docs. I love this about the ruby/elixir documentation: as you are looking at the docs for a method, you can reveal the source code for that method. It was painful to (a) jump into a ipython session (b) import the module (c) ?? module.reference to view the source code.
Package documentation in Dash
More & better defaults in django setup.
Improved stdlib map-reduce. If you can’t fit your data transformation into a comprehension, it’s painful to write and read. You end writing for loops and appending to arrays.
Format code references in the path/to/file.py:line:col format for easy click-to-open support in various editors. This drove me nuts when debugging stack traces.
Improved TypedDict support. It seems this is a relatively new feature, and it shows. They are frustrating to work with.

Open Questions

I hope to find an excuse to dig a bit more into the python ecosystem, specifically to learn the ML side of things. Here are some questions I still had at the end of the project:

Does numpy/pandas eliminate data manipulation pain? My biggest gripe with python is the lack of chained data manipulation operators like ruby/elixir.
How does the ML/AI/data science stuff work? This was one of my primary motivations for brushing up on my python skills and I’d love to deeply explore this.
How does async/await work in python?
How does asset management / frontend work in django?

Debugging `asdf` plugin issues

Although unrelated to this post, I had to debug some issues with an asdf plugin. Here’s how to do this:

Clone the asdf plugin repo locally: git clone https://github.com/asdf-community/asdf-poetry ~/Projects/
Remove the existing version of the repo ~/.asdf/plugins && rm -rf poetry
Symlink the repo you cloned: ln -s ~/Projects/asdf-poetry poetry

Now all commands hitting the poetry plugin will use your custom local copy.