Lessons learned building with Django, Celery, and Pytest
As someone who writes ruby professionally, I recently learned python to build a bot which buys an index of crypto using binance.
The best thing about ruby is Rails, so I wanted an excuse to try out Django and see how it compared. Adding multi-user mode to the crypto bot felt like a good enough excuse. My goal was to:
- Add a model for the user that persisted to a database
- Cron job to kick off a job for each user, preferably using a job management library
- Add some tests for primary application flows
- Docker-compose for the DB and app admin
I’ll detail learnings around Docker in a separate post. In this post, I walk through my raw notes as I dug into the django + python ecosystem further.
(I’ve written some other learning logs in this style if you are interested)
Open Source Django Projects
I found a bunch of mature, open-source django projects that were very helpful to grep (or, ripgrep) through. Clone these into a ~/Projects/django
folder so you can easily search through them locally when learning:
- https://github.com/getsentry/sentry
- https://github.com/arrobalytics/django-ledger
- https://github.com/intelowlproject/IntelOwl
- https://github.com/mdn/kuma – manages the MDN docs
- https://github.com/apache/airflow
- https://github.com/kiwicom/kiwi-structlog-config – Advanced structlog configuration examples.
More python language learnings
I learned a bunch more about the core python language. I was using the most recent (3.9) version of python at the time.
- You can setup imports in
__init__
to make it more convenient for users to import from your package. - As of python3, you don’t need a
__init__
within a folder to make it importable. - You can import multiple objects in a single statement
from sentry.db.models import (one, two, three)
- iPython can be setup to automatically reload modified code.
- Somehow VS Code’s
python.terminal.activateEnvironment
got enabled again. This does not seem to play well with poetry’s venv. I disabled it and it eliminated some weird environment stuff I was running into. - When using poetry, if you specify a dependency with
path
in yourtoml
, even if it’s the dev section, it still is referenced and validated when runningpoetry install
. This can cause issues when building dockerfiles for production when still referencing local copies of a package you are modifying. - It doesn’t seem like there is a way to force a non-nil value in mypy. If you are getting typing errors due to nil values
assert var is not None
ort.cast
are the best options I found. - Inline return with a condition is possible:
if not array_of_dicts: return None
- There doesn’t seem to be a one-command way to install pristine packages.
poetry env remove python && poetry env use python && poetry install
looks like the best approach. I ran into this when I switched a package to reference a github branch; the package was already installed and poetry wouldn’t reinstall it from the github repo. - You can copy/paste functions into a REPL with iPython, but without iPython enabled it’s very hard to copy/paste multiline chunks of code. This is a good reason to install iPython in your production deployment: makes repl debugging in production much easier.
- By default all arguments can be either keyword or positional. However, you can define certain parameters to be positional-only using a
/
in the function definition. - Variable names cannot start with numbers. This may seem obvious, but when you are switching from using dicts to
TypedDict
you may have keys which start with a number that will only cause issues when you start to constructTypedDict
instances. - There is not a clean way to update TypedDicts. Looks like the easiest way is to create a brand new one or type cast a raw updated dict.
- Cast a union list of types to a specific type with
typing.cast
. - Convert a string to a enum via
EnumClassName('input_string')
as long as your enum hasstr
as one of its subclasses. - Disable typing for a specific line with
# type: ignore
as an inline comment - Memoize a function by specifying a variable as global and setting a default value for that variable within the python file the function is in. There is also
@functools.cache
included with stdlib for this that should work in most situations. - mypy is a popular type checker, but there’s also pyright which is installed by default with pylance (VS Code’s python extension).
- pylint seems like the best linter, although I was surprised at how many different options there were. This answer helped me get it working with VS Code.
- Magic methods (i.e.
__xyz__
) are also called dunder methods. - A ‘sentinel value is used to distinguish between an intentional
None
value and a value that indicates a failure, cache miss, no object found, etc. Thinkundefined
vsnull
in Javascript. First time I heard it used to describe this pattern. - The
yield
keyword is interesting. It returns the value provided, but the state of the function is maintained and somehow wrapped in a returned iterator. Each subsequentnext
will return the value of the nextyield
in the logic - Unlike ruby, it does not seem possible to add functions to the global namespace. This is a nice feature; less instances of ‘where is this method coming from.
- Black code formatting is really good. I thought I wouldn’t like it, but I was wrong. The cognitive load it takes off your mind when you are writing code is more than I would have expected.
Structured logging with context & ENV-customized levels
structlog
is a really powerful package, but the documentation is lacking and was hard to configure. Similar to my preferred ruby logger I wanted the ability to:
- Set global logging context
- Easily pass key/value pairs into the logger
- Configure log level through environment variables
Here’s the configuration which worked for me:
# utils.py
import structlog
from decouple import config
from structlog.threadlocal import wrap_dict
def setLevel(level):
level = getattr(logging, level.upper())
structlog.configure(
# context_class enables thread-local logging to avoid passing a log instance around
# https://www.structlog.org/en/21.1.0/thread-local.html
context_class=wrap_dict(dict),
wrapper_class=structlog.make_filtering_bound_logger(level),
cache_logger_on_first_use=True,
)
log_level = config("LOG_LEVEL", default="WARN")
setLevel(log_level)
log = structlog.get_logger()
To add context to the logger and log a key-value pair
from utils import log
log.bind(user_id=user.id)
log.info("something", amount=amount)
Django
poetry add django
to your existing project to get started.- Then,
poetry shell
and rundjango-admin startproject thename
to setup the project - Django has an interesting set of bundled apps: activeadmin-like
- Swap the DB connection information in
settings.py
to use PG andpoetry add psycopg2
. - Django will not create the database for you, so you need to run
CREATE DATABASE <dbname>;
to add it before running your migrations. - The default configuration does not pull from your ENV variables. I’ve written a section below about application configuration; it was tricky for me coming from rails.
django-extensions
is a popular package that includes a bunch of missing functionality from the core django project. Some highlights: shell_plus, reset_db, sqlcreate.- It doesn’t look like there are any generators, unlike rails or phoenix.
- Asset management is not included. There’s a host of options you can pick from.
- There’s a full-featured ORM with adaptors to multiple DBs. Here are some tips and tricks:
- There’s a native
JSONField
type which is compatible with multiple databases. Usesjsonb
under the hood when postgres is in place. - After you’ve defined a model, you autogen the migration code and then run the migrations.
python manage.py makemigrations
- Then, to migrate:
python manage.py migrate
- To get everything:
User.objects.all()
orUser.objects.iterator()
to page through them. - Getting a single object:
User.objects.get(id=1)
- Use
save()
on an object to update or create it - Create an object in a single line using
User.objects.create(kwargs)
- There’s a native
- You need a project (global config) and apps (actual code that makes up the core of your application)
- It looks like django apps (
INSTALLED_APPS
) are sort of like rails engines, but much more lightweight. - Apps can each have their own migrations and they are not stored in a global folder. For instance, the built-in auth application has a bunch of migrations that will run but are not included in your application source code. I found this confusing.
- Table names are namespaced based on which app the model is in. If you have a
user
model in ausers
app the table will be namedusers_user
. - It looks like there is a unicorn equivalent,
gunicorn
, that is the preferred way for running web workers. It’s not included or configured by default. - Flask is a framework similar to sinatra: simple routing and rendering web framework.
- The app scaffolding is very lightweight. Views, models, tests, and admin UI has a standard location. Everything else is up to the user.
- There’s a caching system built into django, but it doesn’t support redis by default. I already have redis in place, so I don’t want to use the default adapter (memcache). There’s a package
django-redis
that adds redis support to django cache. django-extensions
has a niftySHELL_PLUS_PRE_IMPORTS = [("decimal", "Decimal")]
setting that will auto-import additional packages for you. It’s annoying to have toimport
various objects just to poke around in the REPL, and this setting eliminates this friction.
Use decimal objects for floats when decoding JSON
In my case, I needed to use Decimals instead of floats everywhere to avoid floating point arithmetic inaccuracies. Even $0.01 difference could cause issues when submitting orders to the crypto exchange.
This is really easy when parsing JSON directly:
requests.get(endpoint).json(parse_float=decimal.Decimal),
If you are using a JSONField
to store float values, it gets more complicated. You can’t just pass parse_float
to the JSONField
constructor. A custom decoder must be created:
class CustomJSONDecoder(json.JSONDecoder):
def __init__(self, *args, **kwargs):
from decimal import Decimal
kwargs["parse_float"] = Decimal
super().__init__(*args, **kwargs)
class YourModel(models.Model):
the_field = models.JSONField(default=dict, decoder=CustomJSONDecoder)
Multiple django environments
There is not a standard way of managing different environments (staging, development, test, prod) in django. I found this very confusing and wasted time attempting to figure out what the best practice was here.
Here are some tips & recommendations:
- Django doesn’t come with the ability to parse database URLs. There’s an extension,
dj_database_url
, for this. - Poetry has a built-in dev category, which can be used for packages only required for development and test packages. There are no separate test or development groups.
- python-dotenv seems like the best package for loading a
.env
file intoos.environ
. However, if you are building an application with multiple entrypoints (i.e. web, cli, repl, worker, etc) this gets tricky as you need to ensureload_dotenv()
is called before any code which looks atos.environ
. - After attempting to get
python-dotenv
working for me, I gavedecouple
a shot. It’s much better: you use it’sconfig
function to extract variables from the environment. That function ensures that.env
is loaded before looking at your localos.environ
. Use this package instead. - By default, Django does not setup your
settings.py
to pull from the environment. You need to do this manually. I included some snippets below. - After getting decouple in place, you’ll probably want separate configurations for different environments. The best way to do this is to set
DJANGO_SETTINGS_MODULE
to point to a completely separate configuration file for each environment.- In your
toml
you can set settings path[tool.pytest.ini_options] DJANGO_SETTINGS_MODULE = "app.settings.test"
to force a different environment for testing. - In production, you’ll set the
DJANGO_SETTINGS_MODULE
toapp.settings.production
in the docker or heroku environment - For all other environments, you’ll set
DJANGO_SETTINGS_MODULE
toapp.settings.development
in yourmanage.py
- In each of these files (
app/settings/development.py
,app/settings/test.py
, etc) you’llfrom .application import *
and store all common configuration inapp/settings/application.py
. - Here’s a working example.
- In your
Here’s how to configure django cache and celery to work with redis:
CACHES = {
"default": {
"BACKEND": "django_redis.cache.RedisCache",
"LOCATION": config("REDIS_URL"),
"OPTIONS": {
"CLIENT_CLASS": "django_redis.client.DefaultClient",
},
}
}
Here’s how to use dj_database_url
with decouple
:
DATABASES = {"default": dj_database_url.parse(config("DATABASE_URL"))}
Job management using Celery
Django does not come with a job queue. Celery is the most popular job queue library out there and requires redis. It looks like it will require a decent amount of config, but I chose to use it anyway to understand how it compared to Sidekiq/Resque/ActiveJob/Oban/etc.
poetry add celery --allow-prereleases
(I needed a prerelease to work with the version ofclick
I was using)- If you are using redis as the broker (easier for me, since I already had it installed + running) you’ll need to
poetry add redis
- Celery does not use
manage.py
so it would not load the.env
file. I needed to manually rundotenv_load()
at the top of your celery config. I discovered that this needed to be conditionally loaded for prod, at which point I discovered thatdecouple
is a much better package for managing configuration. - I put my celery tasks within the
users
application astasks.py
. You can specify a dot-path to the celery config via the CLI:celery -A users.tasks worker --loglevel=INFO
- You can configure celery to store results. If you do, you are responsible for clearing out results. They do not expire automatically.
- Celery has a built-in cron scheduler. Very nice! There’s even a nice
-B
option for running the scheduler within a single worker process (not recommended for prod, but nice for development). - When I tried to access django models, I got some weird errors. There’s a django-specific setup process you need to run through.
DJANGO_SETTINGS_MODULE
needs to be set, just like inmanage.py
. You can’t import django-specific modules at the top of the celery config file.- Celery is threaded by default. If your code is not thread safe, you’ll need to set
--concurrency=1
. - By default, tasks do not run inline. If you want to setup an integration test for your tasks, you need either (a) run tasks in eager mode (not recommended) or (b) setup a worker thread to run tasks for you during your tests.
- Eager mode is not recommended for testing, since it doesn’t simulate the production environment as closely. However, running a worker thread introduces another set of issues (like database cleanup not working properly).
- There’s no real downside to using
@shared_task
instead of@app.task
. It’s easier to do this from the start: less refactoring to do when your application grows.
Testing
Some more learnings about working with pytest
& vcr
in combination with django
:
- Database cleaning is done automatically for you via
@pytest.mark.django_db
at the top of your test class. This is great: no need to pull in a separate database cleaner. - To be able to run
pytest
which relies on django models/configuration, you need thepytest-django
extension. - You can stick any config that would be in
pytest.ini
in your toml file under[tool.pytest.ini_options]
- You need to setup a separate config for your database to ensure it doesn’t use the same one as your development environment. The easiest way to do this is to add
DJANGO_SETTINGS_MODULE = "yourapp.settings.test"
to yourtoml
file and then override the database setup in theyourapp/settings/test.py
file. - You can use pytest fixtures to implement ruby-style around functions.
- Redis/django cache is not cleared automatically between test runs. You can do this manually via
django.core.cache.clear()
- In a scenario where you memoize/cache a global function that isn’t tied to a class, you may need to clear the cache to avoid global state causing indeterminate test results. You can do this for a single method via
clear_cache()
or identify all functions with lru cache and clear them. - Django has a test runner (
python manage.py test
). It seems very different (doesn’t support fixtures), and I ran into strange compatibility issues when using it. Usepytest
instead.
My thoughts on Django
I continue to be impressed with the python ecosystem. The dev tooling (linting, repls, type checking, formatting, etc) is robust, there are reasonably well-written and maintained packages for everything I needed. It seems as though most packages are better maintained than the ruby equivalents. I only once had to dive into a package and hack a change I needed into the package. That’s pretty impressive, especially since the complexity of this application grew a lot more than I expected.
Working with python is just fun and fast (two things that are very important for me!). A similar level of fun to ruby, but the language is better designed and therefore easy to read. You can tell the ecosystem has more throughput: more developers are using various packages, and therefore more configuration options and bugs worked out. This increases dev velocity which matters a ton for a small side project and even more for a small startup. I don’t see a reason why I’d use ruby if I’m not building a rails-style web application.
Rails is ruby’s killer app. It’s better than Django across a couple of dimensions:
- Better defaults.
- Multiple environments supported out of the box.
- Expansive batteries-included components. Job queuing, asset management, web workers, incoming/outgoing email processing, etc. This is the biggest gap in my mind: it takes a lot more effort & decisions to get all of these components working.
- Since django takes a ‘bring your own application components’ approach, you don’t get the benefit of large companies like Shopify, GitHub, etc using these and working out all of the bugs for you.
The Django way seems to be a very slim feature set that can be easily augmented by additional packages. Generally, I like the unix-style single responsibility tooling, but in my experience, the integration + maintenance cost of adding 10s of packages is very high. I want my web framework to do a lot for me. Yes, I’m biased, since I’m used to rails but I do think this approach is just better for rapid application development.
This was a super fun project. Definitely learned to love python and appreciate the Django ecosystem.
What I’m missing
There were some things I missed from other languages, although the list is pretty short and nitpicky:
- Source code references within docs. I love this about the ruby/elixir documentation: as you are looking at the docs for a method, you can reveal the source code for that method. It was painful to (a) jump into a ipython session (b) import the module (c)
?? module.reference
to view the source code. - Package documentation in Dash
- More & better defaults in django setup.
- Improved stdlib map-reduce. If you can’t fit your data transformation into a comprehension, it’s painful to write and read. You end writing for loops and appending to arrays.
- Format code references in the
path/to/file.py:line:col
format for easy click-to-open support in various editors. This drove me nuts when debugging stack traces. - Improved TypedDict support. It seems this is a relatively new feature, and it shows. They are frustrating to work with.
Open Questions
I hope to find an excuse to dig a bit more into the python ecosystem, specifically to learn the ML side of things. Here are some questions I still had at the end of the project:
- Does numpy/pandas eliminate data manipulation pain? My biggest gripe with python is the lack of chained data manipulation operators like ruby/elixir.
- How does the ML/AI/data science stuff work? This was one of my primary motivations for brushing up on my python skills and I’d love to deeply explore this.
- How does async/await work in python?
- How does asset management / frontend work in django?
asdf
plugin issues
Debugging Although unrelated to this post, I had to debug some issues with an asdf plugin. Here’s how to do this:
- Clone the asdf plugin repo locally:
git clone https://github.com/asdf-community/asdf-poetry ~/Projects/
- Remove the existing version of the repo
~/.asdf/plugins && rm -rf poetry
- Symlink the repo you cloned:
ln -s ~/Projects/asdf-poetry poetry
Now all commands hitting the poetry plugin will use your custom local copy.