Introduction
In this post we continue to share some of our internal material aimed towards solution builders, such as data scientists, who want to develop their ability to develop high quality solutions by creating more reliable, maintainable and readable code. This is the second part of 2.
Motivation
High code quality is easy to recognize but can be very hard to describe concretely. The assumed benefits are easier maintainability, modifiability, and more. While code style, like formatting, can be a matter of different taste, most parties agree that other code practices that fall under the umbrella term “anti-patterns” should be avoided. To stop endless formatting discussions and the like, having (and adhering to the industry) standard makes reading and understanding code across repositories easier.
What this guide is not
This guide will not tackle the topic of “how to set up a Python project” the right way . Please let us know in the comments if you would like us to share more of our experience in this area.
…before we start
Read the Zen of Python (or run python -m this in your terminal window).
2. How to use comments, wisely
It is surprising how easily - and how quickly - comments tend to become outdated. Here are a few tips on when and how to use comments:
-
Inline comments should be used when the code does something unexpected or it is not clear why it is needed at all. Always try to avoid adding comments that say more or less the same as the code.
Example of a, quite frankly, useless comment that I’ve seen more times than I can count:# We now sort the index: df = df.sort_index()
Example of a good comment:
import matplotlib # Need to set a compatible backend because server is running Ubuntu: matplotlib.use('Agg')
- The DRY principle: Don't Repeat Yourself, also applies here!
-
If you must, let’s say in the case of a function docstring that explains all the parameters (and repeat the type annotations), at least set up your workflow or
pre-commit
s to use an automatic tool likedarglint
to ensure it stays up-to-date with any future changes.
3. How to use logging / print(s)
Storing logs from solutions running in production can quickly ramp up to become quite expensive if we aren’t careful. Here are a few tips along the way!
Use the different logging levels actively: Having loads of DEBUG and INFO statements is fine, as long as the log level is set higher (e.g. WARNING) when running in production (as opposed to locally on your machine), but…
During development phase / exploration: using output generating statements (like print/log) to figure out what happens - and what goes wrong in your code is an okay way of working, albeit inefficient. It is often a sign of a coding setup missing a good debug tool with introspection- and breakpoint capabilities. IDEs (Integrated Development Environment) like VSCode and PyCharm ship with good tools for this purpose. Learn to use them - this might save you time and sweat in the future!
4. Use of unsafe functions like exec
or eval
Python is an incredibly flexible language, but that does not mean you have an actual need for all the flexibility. The times you have real need for functions like exec and eval (that execute arbitrary code, from string, at runtime, are far and wide between. Most likely, your need can be solved without them! Some well-known “dangerous” functions are yaml.load
(there exists a safe_load
!), pickle.load
, input
(in Python2), relying on asserts triggering (they are omitted if python is run with -o
for optimized bytecode), to name a few. Having bandit
as part of your workflow checks (and pre-commits
) will notify you of potential security concerns like these!
Here are a few common scenarios that do not require the use of unsafe functions:
-
You only know the attribute name (or variable name) at runtime and thus can't write the code upfront. Wrong! Here’s how:
from dataclasses import dataclass @dataclass class CarInfo: color: Tuple[int, int, int] max_speed: float is_electric: bool is_cool: bool (...) my_car = CarInfo(...) attrs = requested_car_properties(...) # E.g.: ["color", "max_speed"] info = [getattr(my_car, a) for a in attrs]
Similarly,
setattr
may be used to set or update properties/variables on an object (anddelattr
also exists). -
You need to look up a variable in the current namespace. Remember that “everything in Python” is a dictionary (well, almost…), and you can get the variable-name-to-value-mapping by using
globals()
orlocals()
, depending on what namespace you are after:>>> aaa = 1337 >>> from typing import * # See the asterisk (pun inteded) on doing star imports! >>> globals() { 'aaa': 1337, 'Any': typing.Any, 'Callable': typing.Callable, ... 'Generator': typing.Generator, 'NoReturn': typing.NoReturn, }
Fun-fact: Python’s namedtuple implementation uses exec From: https://bugs.python.org/issue3974
# Don't believe me?!
from inspect import getsource
from collections import namedtuple
print(getsource(namedtuple))
5. Don’t reinvent existing or built-in functionality
Python ships with a rich standard library! Take a look through the excellent documentation or do a quick Google search before implementing “it” yourself. Assuming your need isn't very fringe, someone else has probably done it already (maybe even better)!
5.1 I need a temporary file or directory!
Let me import uuid
to generate a unique string…then…No! Just use the built-in module: tempfile
(link)
from tempfile import TemporaryDirectory
with TemporaryDirectory() as tmp_dir:
pass
5.2 I need to represent paths on the filesystem
…and decided to use strings. Again, no! Have you thought about all the differences between operating systems, for example!? Either use the built-in module pathlib
(link) or os.path
(link).
from pathlib import Path
cur_dir = Path()
src_file = cur_dir / "src" / "main.py"
with src_file.open() as f:
lines = f.readlines()
# Check if path exists (in this case, the file):
>>> src_file.exists()
False
# Check if path is a directory (there's also .is_file()):
>>> cur_dir.is_dir()
True
# Get file suffix:
>>> src_file.suffix
'.py'
5.3 I need a class to store data
Using classes to store some data (according to a schema) is such an everyday use case that dataclasses made it into the standard library. They remove the need to write a lot of boilerplate code that can be automated. The Python package pydantic implements a drop-in replacement for these that also do type validation at runtime. Let us know in the comments if you would like us to share more examples in this area.
6. A note on data structures
Great code uses data structures that fit the problem. Although this can be pretty hard to verify in a PR, there are a few signs/bad patterns to look out for that may ruin performance while still having super easy fixes:
-
Repeated lookups or “
a in b checks
”: If b in this example is a list, the runtime grows quadratically with the number of checks (since each check might need to run through the entire list). Solution: change b to a hash table-based data structure like set,frozenset
, or the keys in a dictionary (dict_keys
). -
Appending rows to a pandas
DataFrame
inside of a loop. Unlike the built-inlist
in Python, a pandasDataFramestores
its data in fixed-size arrays (thanksnumpy
!). That means that any operation wanting to change the length of the array will trigger a full reallocation. This. Is. Slow. Solution: Append the rows to a list, and after the loop is done, do apd.concat(row_lst)
.
7. Take great care when dealing with (date)time, timezones etc.
Python ships with the built-in date and time library datetime
. Contrary to other languages, the decision was made that naive (meaning “no timezone information”) should be interpreted as “local time”. On the positive side this means any user, independent of location, daylight savings time setting and timezone would see datetime.now()
to match their wrist-watch/calendar.
Warning: This module also has some functions/methods that are prefixed with utc
. Be especially careful about these, as they e.g. parse assuming UTC, then leaves the datetime
as a naive object... future evaluations will still assume it to be local time
.
Do not: The Python Cognite SDK has decided to interpret naive as UTC, which will probably be changed in a future (breaking change) version. Thus, please never pass datetime
to the SDK (future-proofing!).
Solution: What we suggest you do instead is to use a third-party library like arrow
that uses timezone-aware objects by default! That way any method does not have to guess how to interpret it (“local” or “UTC”?!) and you will always get correct behavior! Code to illustrate:
>>> from datetime import datetime
>>> str(datetime.now()) # No timezone info :(
'2021-10-07 16:07:46.220471'
>>> str(datetime.utcnow()) # Still no timezone info :(
'2021-10-07 14:07:46.221051'
# Seconds since epoch:
>>> datetime.now().timestamp()
1633615666.220471 # Always correct!
>>> datetime.utcnow().timestamp()
1633612066.220471 # Wrong! Except e.g. servers running in UTC
>>> import arrow
>>> now = arrow.now()
>>> str(now) # Hey look, timezone info:
'2021-10-07T16:07:46.221207+02:00'
>>> str(arrow.utcnow()) # Still timezone info:
'2021-10-07T14:07:46.221408+00:00'
>>> arrow.now().timestamp()
1633615666.221207 # Always correct!
>>> arrow.utcnow().timestamp()
1633615666.221207 # Also, always correct!
Additionally, you should never write any sort of exotic parsing of text into datetime
on your own. Here is a long list of common non-intuitive datetime assumptions you may falsely believe are true - and that may ultimately crash your code.
Instead, use arrow.get
(or datetime.strptime
) and pass in the date string and the format. NOTE: Do not forget about timezone!
Resources