Cognites guide to code quality in Python - Part 1

2 years ago
22 April 2022
0 replies
206 views

Userlevel 4

Håkon V. Treider
Practitioner
56 replies

Introduction

In this post we share some of our internal material aimed towards solution builders, such as data scientists, who want to develop their ability to develop high quality solutions by creating more reliable, maintainable and readable code. This is the first part of 2.

Motivation

High code quality is easy to recognize but can be very hard to describe concretely. The assumed benefits are easier maintainability, modifiability, and more. While code style, like formatting, can be a matter of different taste, most parties agree that other code practices that fall under the umbrella term “anti-patterns” should be avoided. To stop endless formatting discussions and the like, having (and adhering to the industry) standard makes reading and understanding code across repositories easier.

What this guide is not

This guide will not tackle the topic of “how to set up a Python project” the right way . Please let us know in the comments if you would like us to share more of our experience in this area.

...before we start

Read the Zen of Python (or run python -m this in your terminal window).

1. Avoid well-known anti-patterns

1.1 EAFP vs LBYL

The Python community favors an EAFP (Easier to Ask for Forgiveness than Permission) coding style. What does this mean in practice? This principle goes hand-in-hand with using try-except blocks (don’t worry about performance, they are incredibly efficient if no exceptions are raised).

Inserting data into a time series. With the try-except pattern we can, on average, make do with half the number of API-calls as opposed to the LBYL (Look Before You Leap) pattern:

# LBYL: Check before insert (2 API-calls on average):

xid = "my-ts-external-id"

if not client.time_series.retrieve(external_id=xid):

    client.time_series.create(...)

client.datapoints.insert(...)

 

# EAFP: Insert directly (1 API-call on average):

try:

    client.datapoints.insert(...)

except CogniteNotFoundError:

    client.time_series.create(...)

    client.datapoints.insert(...)

1.2 Race conditions

See: Wikipedia:TOCTOU

Guido van Rossum on Twitter

Any time you need to check for “something” that can change (or mutate state) between the time-of-check and the time-of-your-future-action(s) can and will break. Many examples fall into this category.

Checking if container object (like a dictionary) has a specific value, before using it. Here you don’t need a try-except as most such objects have a .get(key, fallback) or similar. Note however, you should never use this pattern unless you are expecting values to be missing - as this will hide potential KeyErrors.

# Example 1 (LBYL):

if os.path.exists(file):

    os.remove(file)  # Race condition: Might throw error!

# Example 2 (LBYL):

if key in dct:

    value = dct[key]

else:

    value = None

# Example 3 (LBYL):

if hasattr(obj, "external_id"):

    xid = obj.external_id

else:

    xid = None

     

# Example 1 (EAFP):

try:

    os.remove(file)

except FileNotFoundError:

    pass

# Example 2 (EAFP):

try:

    value = dct[key]

except KeyError:

    value = None

# Example 3 (EAFP, recommended):

try:

    xid = obj.external_id

except AttributeError:

    xid = None

 

## One-liner solutions:

# Example 1, recommended solution:

Path(file).unlink(missing_ok=False)

# Example 2, recommended solution:

value = dct.get(key)

# Example 3, NOT recommended, use EAFP example above:

xid = getattr(obj, "external_id", None)

Let’s expand a bit on example 1 from above. After you do a check for existence, another process may get in your way; like moving it, changing its filename, or simply deleting it before you could move on to your next command. Try to use idempotent functions as much as possible (translated: no matter how many times you call the operation, the result will be the same).

from pathlib import Path

 

file = Path("myfile.txt")

file.unlink(missing_ok=False)

1.3 My favorite variable naming!

If we are to believe Martin Fowler, there are two hard problems in computer science: cache invalidation and naming things.

Luckily for us, we have some guidelines to follow: PEP8 Naming Conventions. But wait, that doesn't tell us anything about the “content” of the variable?! Should we write out a_long_descriptive_variable_name, or is shrt_descr_varname better? Looking at the standard library, we can observe that the latter is the defacto standard: readable abbreviations. Another point about dynamic languages like Python is that you might want to dip your toes into “a dialect” of Hungarian notation from time to time. Unless obvious from context, adding type info in the variable name can good. A list of integers can be int_lst, and a dictionary of numpy arrays can be arr_dct. Using “type hints”, annotations, can make this even more evident. See the below section on this.

Do not overwrite built-ins (don’t use list or id as your variable name).
Favor read-time convenience over write-time convenience.
Be consistent! For example, a function name prefix for all functions in your codebase that can “retrieve_” something should not be mixed with “get_” or “fetch_”. In addition, this helps your IDE provide more useful suggestions for you!
Adhere to common practices! For example, df should be a pandas DataFrame! (at least in the realm of data science ).

To finish off, here is an excellent quote from PEP8 that was linked above, but you probably didn't click (wink) :

Naming Conventions

1.4 Using the coolest new syntax/stuff you just discovered - or learned - in your code…

…instead of seeing what code pattern fits the problem the best, just because you are so, so, so excited to show off how awesome your new discoveries are! One example could be using map and filter to make a readable list comprehension, unreadable, or by writing a context manager class from scratch by implementing magic methods __enter__ and __exit__ instead of just using the decorator @contextlib.contextmanager (not saying the first is never needed!). Remember the Zen: Simple is better than complex and readability counts!

def filter_on_name(ts):

    return "IA_29HS3001" in ts.name

     

# ...or to maximize obfuscation:

import operator as op

from toolz.functoolz import compose, curry, flip

 

filter_on_name = compose(  # Seriously, don't!

  curry(flip(op.contains))("IA_29HS3001"), op.attrgetter("name")

)

1.5 Using star imports

If you are unfamiliar with the term, it refers to this import pattern: from my_module import *, which imports everything from the given module or package into the global namespace. It has three main problems:

You (and your linter) might lose track of where functions are coming from.
You might overwrite earlier imports unknowingly (e.g. from math import *, and from numpy import *, the last one survives!)
You will probably import a ton of stuff that you are not going to use anyway (huuuge waste, eh?)

On the subject of imports, try to avoid the following pattern: Whenever the functions (or constants or classes or…) you’re importing have a non-specific name, like get, compute, generate, process, or similar, importing them directly instead of the module they are from. What’s the problem, you might think, with something like this innocent-looking statement: from requests import get?! It is for readability - or should we say, understandability - for your future self, and others that might need to parse your code. Seeing a function deep inside the code saying get(x) versus requests.get(x) are two very different things. Only the latter is crystal clear about its context; where it's coming from, from which what it does can probably be understood much more quickly. Here is an extreme, but real, example:

from scipy.stats import t

(...)  # loads of code

num = t(2.74).pdf(0.01)  # wtf is t?! ...and pdf?!

 

# Versus doing:

import scipy.stats

num = scipy.stats.t(2.74).pdf(0.01)

1.6 Using a class when a simple function would suffice

Although classes in Python are flexible and awesome, make sure you can explain the benefit of using a class over a function, before writing it. A class with an __init__ and a single other method is likely not worthy of being a class! In essence, a class is a way to make a “logical grouping” of data and functions meant to be used on/with that data, into its own nice little “capsule”. No real connection to be made? Throw the functions into their own module instead - don’t use a class just to collect e.g. utilities! Remember, you may import the module and use the dot syntax to look them up individually!

1.7 Designing for testability

A excellent thought to have in mind is to write code that is easy to test. It is such a simple, yet powerful idea that might lead you towards better code! So how do you do that? The common mantra is to write small, tidy functions (and classes) with few, or preferably, a single responsibility. In reality though, code will always be messy to some degree. One pattern you should always keep at the back of your mind is dependency injection: It means that any piece of code you produce should be provided everything it needs (in order to run/execute), as opposed to creating these resources (or whatever) on its own. To bring it back to testability: this code now can easily be tested, by for example passing one or more mocked objects, instead of having to resort to advanced testing techniques. You can read more about it here: https://python-dependency-injector.ets-labs.org/introduction/di_in_python.html (PS: you may also want to search for the SOLID principles of software engineering!).

An overall example:

# WITHOUT dependency injection:

class Gateway:

    def __init___(self):

        db_connection_str = os.getenv("DB_CONNECTION")

        self._database = DataBase(db_connection_str)

        api_key = os.getenv("API_KEY")

        self.client = QueryClient(api_key, db=self._database)

 

    def run_query(self, query):

        self.validate_query(query)

        return self.client.call(query)

 

# WITH dependency injection (`run_query` unchanged):

class Gateway:

    def __init___(self, client: QueryClient, database: DataBase):

        self._database = database

        self.client = client

1.8 Returning different types from a function

Although good, valid use cases exist for this (json.loads(obj) is a prime example), as a rule of thumb, it's a red flag when the output type might vary depending on the inputs. Try to return the most restricted type possible. Annotating your functions and adding a static type checker (like mypy) to your project setup is recommended! If your type hints start to outgrow the code, it might be a good hint (pun intended) to refactor the code!

Note: Excluded from this rule are functions that, for example, accept an Iterable, and return the same type as the given input (duck-typing at its finest!).

Check first if the function secretly has two (or more) responsibilities and if so, split them apart! If that is not the case, see if a different pattern is possible with None as the only other valid return value. Here is an example of a bad function:

# Don't do this kids
def calculate(*args):
    series = pd.Series(mathy_function(*args))
    if not series.empty:
        return series
    return []  # <-- inconsistent type

1.9 Not using type annotations

Python's documentation starts with a disclaimer, which is worth repeating here: The Python runtime does not enforce function and variable type annotations. They can be used by third-party tools such as type checkers, IDEs, linters, etc.

The main benefit is increased visibility into the code’s expectations for future developers (yourself included) and users of the code. Linters like mypy may also point out inconsistencies in your code, like "expected type to be pathlib.Path, got str".

Tip: To find out what type mypy infers, use reveal_type:

reveal_type(1023)  # -> Revealed type is "builtins.int"

Another tip: mypy cheat sheet: https://mypy.readthedocs.io/en/stable/cheat_sheet_py3.html

1.10 Returning a different number of elements from a function

This anti-pattern has most of the same reasoning as the previous; there are valid use cases, but the general rule is to avoid.

# Don't do this either, kids

def my_calculator(foo, bar):

    res = do_something(foo) + do_something(bar)

    if res > 0:

        return True, res**2, "magic"

    return False  # <-- inconsistent number of return args

The way to resolve this would be to pack together the necessary return values into an object, like a dataclass, dict, namedtuple or similar, then return that one instead.

1.11 Shadowing variable names in an outer scope

As we learned from the Zen, namespaces are one honking great idea, but you should follow some principles to avoid confusion and unintended consequences. There are 4 different scopes: Local, Enclosing, Global and Built-in. (I recommend this great guide: https://realpython.com/python-namespaces-scope/.) This is also the lookup order, i.e. the search order your Python interpreter goes through when it needs to find the variable you are using/referring to in your code. If all four come up empty, you get a NameError:

image-20220119-112251.png?version=1&modificationDate=1642591375662&cacheVersion=1&api=v2&width=170&height=170

This anti-pattern concerns the use of having the same variable names in (e.g.) local scope and an outer scope “shadowing” because it’s easy to end up in a situation (after a refactor for instance) where the wrong variable is accidentally used - the one in outer scope - instead of an error being raised.

# Don't, just don't!

pi = math.pi

def find_connection_latency(conn):

    # Bad variable name was "fixed" from 'pi' to 'ping':

    ping = timer(conn.connect())

    ...

    return pi  # Oops, forgot this one

…the good thing now is that the ping is super stable

1.12 Overwriting built-in names

Just a specific example of the above rule, do not use the “reserved” (but overwritable) built-in names, for example, id = 42 or list = [1, 2, 3]. Fun fact: In Python2, you could actually redefine True and False. That’s a honking bad idea!

You can use https://pypi.org/project/flake8-builtins/ to automatically enforce this!

1.13 Use of the `global` or `nonlocal` keywords.

Using these keywords allows you to modify variables outside of your current scope, not just use them. Although simple uses like having a counter or timer (or similar) are probably okay; it is still a red flag indicating that you should (probably) refactor your code away from this pattern, for example using classes.

1.14 Using mutable default arguments in functions or as class variables

In Python, a function is just an object, and so it has stored inside the default argument values (if any). The problems arise when these are mutated by the function themselves - as that will lead to the function retaining state between calls. Here’s a banal example:

def append_numbers(*nums, lst=[]):

    lst.extend(nums)

    return lst

     

>>> my_numbers = list(range(3))

>>> append_numbers(40, 42, 44, lst=my_numbers)

[0, 1, 2, 40, 42, 44]  # Expected

>>> append_numbers(40, 42, 44)

[40, 42, 44]  # Expected

>>> append_numbers(10, 11)

[40, 42, 44, 10, 11]  # <-- Not expected!

As we can see, when not passing a list to append to, the function used the default one - and repeated calls now mutate this list in place. We may inspect the function object, and to no surprise, we see the list above:

>>> append_numbers.__kwdefaults__

{'lst': [40, 42, 44, 10, 11]}

How to solve this? Use None to indicate missing/not passed instead. The code inside the function is executed for every call, as opposed to the function signature, which is only evaluated once:

def append_numbers(*nums, lst=None):

    lst = lst or []

    lst.extend(nums)  # Now we Gucci

    return lst

1.15 Hiding or ignoring errors

Python has several mechanisms to avoid “critical failure”. You should use these features with caution to avoid hiding errors that would otherwise bubble up to the surface, revealing bugs. When catching errors (or exceptions), try always to be specific, never do the “bare except” to catch “anything and everything”:

# Another one of those "plz never do":

try:

   might_fail_spectacularly()

except:  # This is the so-called "bare except"

    pass

The “bare except” is particularly bad since it also catches exceptions inheriting from BaseException like KeyboardInterrupt and SystemExit. Doing except Exception is slightly better, but picking one or more specific errors is best!

You can apply the same logic to container object methods that look up something that does not raise on a missing element (e.g. dict.get). Use sparingly and only when you expect missing entries!

In the opposite case, when you want to make it known to the (code) reader that the error just can be ignored, you can be clear about it:

with contextlib.suppress(ThatParticularException):

    pass  # Beautiful!

1.16 Excessive indentation

This is typically a good hint that your function is too long and should be split up. Also, check for exit-clauses that don't need the else, and you can remove them and de-dent the code. Another idea is to flip the order if one clause is much longer than the other(s):

# Rather than this:

if a:

   (...)  # 500 lines of code

else:

    return None

     

# Do the flipppppp:

if not a:

    return None

(...)  # 500 lines, not even indented!

Another pattern you can use to reduce the complexity is the “guard clause” pattern (cases that should result in an immediate exit/return from the function). You can read more about it here: https://deviq.com/design-patterns/guard-clause

1.17 Creating an excessive amount of custom exceptions

In general, try to stick with the built-in exceptions like ValueError, TypeError, etc., and only create new custom exceptions when you may argue what increase in value they give you. One example is that you are creating library code and want the users to be able to catch specific errors you might produce, or you need to carry additional information with the raised exception (for example the status code for a failed HTTP request).

1.18 Missing understanding of `Iterable` objects

Making an object iterable in Python is so easy you should expect all container-like objects (meaning they store some collection of other items) to have implemented this protocol. What am I trying to say: you do not need to convert your object into a list before iterating through it!

Here are a few repeat offenders:

# For a generic loop, what do you iterate over?

for element in iterable:

    print(element)

 

# ITERABLE    : ELEMENT

# -------------------------------------------

# built-in str : one character at the time

# built-in set : the elements, assume random order

# built-in dict   : the keys (>=PY3.6 in insertion order)

# numpy ndarray   : the values (for multidimensional data, you iterate the first dimension,

#               (i.e. you get numpy ndarray slices))

# pandas DataFrame: column names as string

# pandas Series   : the values

This is all for now! If you want more, keep an eye open for part 2.

0 replies

Be the first to reply!