Cognites guide to code quality in Python - Part 2

  • 7 June 2022
  • 0 replies
  • 100 views

Userlevel 4
Badge

Introduction 

In this post we continue to share some of our internal material aimed towards solution builders, such as data scientists, who want to develop their ability to develop high quality solutions by creating more reliable, maintainable and readable code. This is the second part of 2.

Motivation

High code quality is easy to recognize but can be very hard to describe concretely. The assumed benefits are easier maintainability, modifiability, and more. While code style, like formatting, can be a matter of different taste, most parties agree that other code practices that fall under the umbrella term “anti-patterns” should be avoided. To stop endless formatting discussions and the like, having (and adhering to the industry) standard makes reading and understanding code across repositories easier.

What this guide is not

This guide will not tackle the topic of  “how to set up a Python project” the right way :tm: . Please let us know in the comments if you would like us to share more of our experience in this area. 

…before we start

Read the Zen of Python (or run python -m this in your terminal window).

 

2. How to use comments, wisely

It is surprising how easily - and how quickly - comments tend to become outdated. Here are a few tips on when and how to use comments:

  1. Inline comments should be used when the code does something unexpected or it is not clear why it is needed at all. Always try to avoid adding comments that say more or less the same as the code.
    Example of a, quite franklyuseless comment that I’ve seen more times than I can count:

    # We now sort the index:

    df = df.sort_index()

    Example of a good comment:

    import matplotlib

    # Need to set a compatible backend because server is running Ubuntu:

    matplotlib.use('Agg')
  2. The DRY principle: Don't Repeat Yourself, also applies here!
  3. If you must, let’s say in the case of a function docstring that explains all the parameters (and repeat the type annotations), at least set up your workflow or pre-commits to use an automatic tool like darglint to ensure it stays up-to-date with any future changes.

3. How to use logging / print(s)

Storing logs from solutions running in production can quickly ramp up to become quite expensive if we aren’t careful. Here are a few tips along the way!

Use the different logging levels actively: Having loads of DEBUG and INFO statements is fine, as long as the log level is set higher (e.g. WARNING) when running in production (as opposed to locally on your machine), but…

During development phase / exploration: using output generating statements (like print/log) to figure out what happens - and what goes wrong in your code is an okay way of working, albeit inefficient. It is often a sign of a coding setup missing a good debug tool with introspection- and breakpoint capabilities. IDEs (Integrated Development Environment) like VSCode and PyCharm ship with good tools for this purpose. Learn to use them - this might save you time and sweat in the future! (blue star)

4. Use of unsafe functions like exec or eval

Python is an incredibly flexible language, but that does not mean you have an actual need for all the flexibility. The times you have real need for functions like exec and eval (that execute arbitrary code, from string, at runtime, are far and wide between. Most likely, your need can be solved without them! Some well-known “dangerous” functions are yaml.load (there exists a safe_load!), pickle.load, input (in Python2), relying on asserts triggering (they are omitted if python is run with -o for optimized bytecode), to name a few. Having bandit as part of your workflow checks (and pre-commits) will notify you of potential security concerns like these!

Here are a few common scenarios that do not require the use of unsafe functions:

  1. You only know the attribute name (or variable name) at runtime and thus can't write the code upfront. Wrong! Here’s how:

    from dataclasses import dataclass



    @dataclass

    class CarInfo:

        color: Tuple[int, int, int]

        max_speed: float

        is_electric: bool

        is_cool: bool

        (...)



    my_car = CarInfo(...)

    attrs = requested_car_properties(...)  # E.g.: ["color", "max_speed"]

    info = [getattr(my_car, a) for a in attrs]

    Similarly, setattr may be used to set or update properties/variables on an object (and delattr also exists).

  2. You need to look up a variable in the current namespace. Remember that “everything in Python” is a dictionary (well, almost…), and you can get the variable-name-to-value-mapping by using globals() or locals(), depending on what namespace you are after:

    >>> aaa = 1337

    >>> from typing import *  # See the asterisk (pun inteded) on doing star imports!

    >>> globals()

    {

        'aaa': 1337,

        'Any': typing.Any,

        'Callable': typing.Callable,

        ...

        'Generator': typing.Generator,

        'NoReturn': typing.NoReturn,

    }

Fun-fact: Python’s namedtuple implementation uses exec (wink) From: https://bugs.python.org/issue3974

# Don't believe me?!

from inspect import getsource

from collections import namedtuple

 

print(getsource(namedtuple))

image-20220127-234344.png?version=1&modificationDate=1643327030969&cacheVersion=1&api=v2&width=340&height=262

5. Don’t reinvent existing or built-in functionality

Python ships with a rich standard library! Take a look through the excellent documentation or do a quick Google search before implementing “it” yourself. Assuming your need isn't very fringe, someone else has probably done it already (maybe even better)! (wink)

5.1 I need a temporary file or directory!

Let me import uuid to generate a unique string…then…No! Just use the built-in module: tempfile (link)

from tempfile import TemporaryDirectory

 

with TemporaryDirectory() as tmp_dir:

    pass

5.2 I need to represent paths on the filesystem

and decided to use strings. Again, no! Have you thought about all the differences between operating systems, for example!? Either use the built-in module pathlib (link) or os.path (link).

from pathlib import Path

 

cur_dir = Path()

src_file = cur_dir / "src" / "main.py"

 

with src_file.open() as f:

    lines = f.readlines()

 

# Check if path exists (in this case, the file):

>>> src_file.exists()

False

 

# Check if path is a directory (there's also .is_file()):

>>> cur_dir.is_dir()

True

 

# Get file suffix:

>>> src_file.suffix

'.py'

5.3 I need a class to store data

Using classes to store some data (according to a schema) is such an everyday use case that dataclasses made it into the standard library. They remove the need to write a lot of boilerplate code that can be automated. The Python package pydantic implements a drop-in replacement for these that also do type validation at runtime. Let us know in the comments if you would like us to share more examples in this area.

6. A note on data structures

Great code uses data structures that fit the problem. Although this can be pretty hard to verify in a PR, there are a few signs/bad patterns to look out for that may ruin performance while still having super easy fixes:

  1. Repeated lookups or “a in b checks”: If b in this example is a list, the runtime grows quadratically with the number of checks (since each check might need to run through the entire list). Solution: change b to a hash table-based data structure like set, frozenset, or the keys in a dictionary (dict_keys).

  2. Appending rows to a pandas DataFrame inside of a loop. Unlike the built-in list in Python, a pandas DataFramestores its data in fixed-size arrays (thanks numpy!). That means that any operation wanting to change the length of the array will trigger a full reallocation. This. Is. Slow. Solution: Append the rows to a list, and after the loop is done, do a pd.concat(row_lst).

7. Take great care when dealing with (date)time, timezones etc.

Python ships with the built-in date and time library datetime. Contrary to other languages, the decision was made that naive (meaning “no timezone information”) should be interpreted as “local time”. On the positive side this means any user, independent of location, daylight savings time setting and timezone would see datetime.now() to match their wrist-watch/calendar.

Warning: This module also has some functions/methods that are prefixed with utc. Be especially careful about these, as they e.g. parse assuming UTC, then leaves the datetime as a naive object... future evaluations will still assume it to be local time.

Do not: The Python Cognite SDK has decided to interpret naive as UTC, which will probably be changed in a future (breaking change) version. Thus, please never pass datetime to the SDK (future-proofing!).

Solution: What we suggest you do instead is to use a third-party library like arrow that uses timezone-aware objects by default! (heart) That way any method does not have to guess how to interpret it (“local” or “UTC”?!) and you will always get correct behavior! Code to illustrate:

>>> from datetime import datetime

>>> str(datetime.now())  # No timezone info :(

'2021-10-07 16:07:46.220471'

>>> str(datetime.utcnow())  # Still no timezone info :(

'2021-10-07 14:07:46.221051'

 

# Seconds since epoch:

>>> datetime.now().timestamp()

1633615666.220471  # Always correct!

>>> datetime.utcnow().timestamp()

1633612066.220471  # Wrong! Except e.g. servers running in UTC

 

>>> import arrow

>>> now = arrow.now()

>>> str(now)  # Hey look, timezone info:

'2021-10-07T16:07:46.221207+02:00'

 

>>> str(arrow.utcnow())  # Still timezone info:

'2021-10-07T14:07:46.221408+00:00'

 

>>> arrow.now().timestamp()

1633615666.221207  # Always correct!

>>> arrow.utcnow().timestamp()

1633615666.221207  # Also, always correct!

Additionally, you should never write any sort of exotic parsing of text into datetime on your own. Here is a long list of common non-intuitive datetime assumptions you may falsely believe are true - and that may ultimately crash your code.

Instead, use arrow.get (or datetime.strptime ) and pass in the date string and the format. NOTE: Do not forget about timezone! (wink)

Resources

 


0 replies

Be the first to reply!

Reply