Posts on Lorenzo Peppoloni

What is slots in Python?

Wed, 26 Aug 2020 08:13:50 +0000

tl;dr

Every Python class can have instance attributes that can be dynamically added/removed/modified. This increases memory usage and results in slower attributes access. If you need to optimize, you can avoid dynamic attributes creation by defining __slots__. Python will now instantiate a static amount of memory to only contain the specified attributes.

Python classes attributes

In Python every class can have instance attributes. This attributes by default are stored in a dict. This has the advantage of being able to dynamically add attributes to a class, for example you can do:

class Foo():
    def __init__(self):
        self.a = 10

f1 = Foo()
f1.b = 20

In this case the attribute b is added dynamically to the class instance f.

If we inspect the attributes of the object by using dir() we can see __dict__, which is the dictionary containing the attributes of the instance.

print(f.__dict__)
{'a': 10, 'b': 20}

Note that this cannot be done with built-in classes, for example:

arr = numpy.ones(10)
arr.foo = 10

num = 10
num.foo = 2

one_set = set([12, 13])
one_set.foo = 11

they will all raise an AttributeError exception.

No dynamic attributes

If we define __slots__ with a list of attributes, we will prevent the dynamic creation of attributes for the class. Let’s modify the first class we created:

class Foo():
    __slots__ = ["a"]

    def __init__(self):
        self.a = 10

f = Foo()
f.b = 20

Now we get AttributeError: 'Foo' object has no attribute 'b'.

If we inspect the attributes of the object, we will see that there is no __dict__ anymore, but we have __slots__ containing the list of attributes, in this case ["a"].

Inheritance

When using inheritance, if the base class has __slots__ defined, it will pass it down the inheritance tree, so there will be no need to re-define it for the inherited attributes. Note that Python does not complain, but you will be using more memory than expected.

class Base:
    __slots__ = ["a", "b"]

class Foo(Base):
    __slots__ = ["c"]  # Correct: Foo already has ["a", "b"] inherited, thus having ["a", "b", "c"]

class Bar(Base):
    __slots__ = ["a", "b", "c"]  # Wrong: no need to re-define ["a", "b"]

By using getsizeof we can see that:

>>> sys.getsizeof(Foo())
72
>>> sys.getsizeof(Bar())
88

Why to use `slots`?

There are two main reasons:

Faster access to attributes

This is the actual reason why __slots__ was introduced. Quoting the History of Python blog

Some people mistakenly assume that the intended purpose of __slots__ is to increase code safety (by restricting the attribute names). In reality, my ultimate goal was performance.

Less used memory

In general the default dict uses a lot of memory, because we cannot just allocate a static amount of memory for the class instance. This can take a toll when we create thousands or millions of objects. By using __slots__ Python will only allocate space for the specified set of attributes.

Is GO good at Math?

Sat, 07 Mar 2020 07:13:50 +0000

We don't usually associate GO with a language to do mathematics, geometry or deep learning. Those tasks are usually left mainly to Python.

But is GO good at math as well?

Disclaimer

This post is an effort to share my experience and knowledge about the topic. Are there languages that are a better fit for math? Yes. Is it possible to do math (at least some simple things) with GO? Yes.

Pre-requisite

To implement our code we will use gonum, which is a GO library for numerical and scientific algorithms. As a plus, it has nice plotting functions as well.

Let's have a quick look at gonum/mat, where the linear algebra libraries are implemented.

The first thing to understand about the library is that everything is done using a pointer receiver, for example:

m1 := mat.NewDense(2, 2, []float64{
    4, 0,
    0, 4,
})
m2 := mat.NewDense(2, 3, []float64{
    4, 0, 0,
    0, 0, 4,
})
var prod mat.Dense
prod.Mul(m1, m2)
fc := mat.Formatted(&prod, mat.Prefix("       "), mat.Squeeze())
fmt.Printf("prod = %v\n", fc)

The code will output:

prod = ⎡16  0   0⎤
       ⎣ 0  0  16⎦

If you see, we defined two matrices, passing the data with a slice of float64 row-major, then we created a new matrix to contain the product and then called the Mul function with the new matrix as a pointer receiver. The final part is just printing using the built-in formatter.

Let's see how to invert a matrix:

m := mat.NewDense(2, 2, []float64{
    4, 0,
    0, 4,
})
var inv mat.Dense
inv.Inverse(m)
fc := mat.Formatted(&inv, mat.Prefix("      "), mat.Squeeze())
fmt.Printf("inv = %v\n", fc)

The code will output:

inv = ⎡0.25    -0⎤
      ⎣   0  0.25⎦

Again, we defined a matrix, we created an empty matrix to contain the inverse and then we inverted the first matrix.

Solving a linear system

Let's try to solve a linear system:

\[ \begin{matrix} x + y + z = 6 \\ 2y + 5z = -4 \\ 2x + 5y - z = 27 \end{matrix} \]

This can be rewritten as \(Ax = b\)

\[ \begin{bmatrix} 4 & 1 & 1 \\ 0 & 1 & 5 \\ 2 & 7 & -1 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} 4 \\ -4 \\ 22 \end{bmatrix} \]

Now the solution would be \( x = A^{-1}b\) being \(A\) a square matrix with \(det \neq 0\).

Using gonum, we can either invert \(A\) or use the more generic function SolveVec, which solves a linear system.

A := mat.NewDense(3, 3, []float64{
    4, 1, 1,
    0, 1, 5,
    2, 7, -1,
})
b := mat.NewVecDense(3, []float64{4, -4, 22})
x := mat.NewVecDense(3, nil)

x.SolveVec(A, b)
fmt.Printf("%v\n", x)

Which outputs:

[0.64705 2.76470 -1.35294]

Neural Network

Now it's time to try something a bit more complex... Let's implement a simple neural network in GO, without going too much into detail on the math (you will have to trust me on that).

Neural Networks ELI5

In a multilayer perceptron, you have an input, an output layer and some hidden layers. Each layer, in its simplest form, consists of a linear transformation (\(y_i = W_ix_i + b_i\), for the i-th layer) plus a nonlinear transformation called activation function (\(y_i = a_i(W_ix_i + b_i)\)). The network is trained using a cost function (\(L\)), which is a function we are trying to optimize.

For example, we have samples as inputs and outputs and we want our network to learn a function that ties the two. The cost function could be the mean squared error (MSE) between the network output given the input or the sum of the squared error (SSE) (this is really ELI5). The weights at each layer (\(W_i\)) and the biases (\(b_i\)) are our tunable parameters.

To optimize the cost function we use gradient descent: at each step, we compute the output of the network, we then compute the derivative at of the cost function with respect to the weights and biases and we update the weights in such a way that we follow the direction of the negative gradient. In principle, each step moves us closer to the minimum of the cost function.

To monitor the training of our network we will be plotting the values of the loss function.

To simplify the code a bit we will assume that the network has no \(b_i\) terms. \(L = \sum (\hat(y)_y - y_i)^2\), so the SSE and our activation function is the sigmoid function (\(\sigma(x) = \frac{1}{1 + \exp(-x)} \)).

Let's see if our network can learn the toy problem in this blog post.

After 1,500 iterations, the output generated by the network is [0.014, 0.98, 0.98, 0.024] (the original output of the table is [0, 1, 1, 0]) which means that our simple network was able to overfit and learn the training set.

The full code can be found here.

Below you can see the plot (made with gonum/plot) of the loss during training.

One gotcha to be aware of while doing some more complex math using gonum is that chaining multiplications in which the matrices dimension change will break the dimension check even if it appears correct.

For example:

m1 := mat.NewDense(2, 3, []float64{
    0, 0, 1,
    0, 1, 1,
})

m2 := mat.NewDense(2, 3, []float64{
    1, 0, 1,
    1, 1, 1,
})

m3 := mat.NewDense(2, 3, []float64{
    0, 0, 3,
    0, 1, 1,
})

var mul mat.Dense
mul.Mul(m1, m2.T())
mul.Mul(&mul, m3)

This code will panic despite the multiplication being perfectly valid \((2\times 3)(3\times 2)(2\times 3)\). The failure happens because the auxiliary matrix you are using as a point receiver is not of the right size to contain the new multiplication.

The solution to this is to create a new auxiliary matrix for each step in which the dimension changes because of the multiplication.

var mul1 mat.Dense
mul1.Mul(m1, m2.T())
var mul2 mat.Dense
mul2.Mul(&mul1, m3)

Plotting

Plotting using gonum/plot is pretty straightforward: you can create an object of type Plot and then you add to it multiple plots using the plotutil package, which contains routines to simplify adding common plot types, such as line plots, scatter plots, etc...

As an example, to plot the loss:

p, err := plot.New()
// check error.
// ...
err = plotutil.AddLinePoints(p, "", points)
// check error.
err = p.Save(5*vg.Inch, 5*vg.Inch, "loss_history.png")
// ...

Conclusions: In this post, we explored the potential of GO to do math and linear algebra. We had a look at the gonum library, first solving a simple linear system and then implementing a simple neural network in GO. We also had a quick look at how you can use gonum to create plots.

Effective strategies for classification in CT scans

Tue, 03 Mar 2020 07:13:50 +0000

Last October I took part in the RSNA Intracranial Haemorrhage Detection Kaggle challenge. I ended up in the top 10%, which considering my full-time job and travelling, was a placement I am quite happy with.

The goal of this post is to share some ideas and strategies to work with classification in CT scans.

The task

The task in this competition was to tackle a multiclass classification problem, to classify 5 different types of brain haemorrhage (a sixth class was "not present") from computerized tomography (CT) scans of patient's heads. Each type of haemorrhage tends to appear in different location of the head with different features. A summary with examples is reported below.

CT scans usually are slices on the axial plane taken at different heights. This means that you can combine consecutive scans to obtain 3D information. In general, scans are provided in the DICOM format, which is an international standard for digital medical images. DICOM files represent pixel intensities in normal units (they can range for example between -32768 and 32767 or less according to the number of bits used).

Scans

First, you can convert the scans from pixel intensities to Hounsfield Units (HU). Hounsfield Units describe a linear scale of radio intensity. Basically, values range between -1000 (radio intensity of air) and 1000 (roughly radio intensity of metal). Harder materials (such as bone or metal) will have a higher radio intensity. Lighter materials, like flesh, soft tissue or water, will have a lower radio intensity.

To convert from the DICOM to HU you usually have to look for "slope" and "intercept" in the file metadata. The two values, which are usually provided by the manufacturer allow you to get the HU:

\[ \text{scan}_{HU} = \text{scan} * \text{slope} + \text{intercept} \]

Now if you try and visualize the images in HU you will probably see something like this

So what's the problem?

The problem is that in a normal grayscale image you can represent 256 different shades, this means that being the HU roughly 2000 values, you have 8 values per shade of grey. As a human, you cannot visually detect changes in shades that are less than 120 HU (in greyscale). That's why you don't see the nice head scans that we were expecting, but instead, you just see a grey blob.

So what do doctors do?

During scan assessment by a human doctor, what is actually done is that each scan is "focused" on a particular range of the Hounsfield Scale, giving information about a certain type of tissue. Doctors usually focus on 2-3 different windows at the same time (according to the assessment they are performing).

In the case of brain haemorrhages, there are 5 important windows, each one focusing on a type of tissue:

Brain Matter window: W:80 L:40
Blood/subdural window: W:130-300 L:50-100
Soft tissue window: W:350–400 L:20–60
Bone window: W:2800 L:600
Grey-white differentiation window: W:8 L:32 or W:40 L:40

The windows are expressed with two numbers W the width of the window and L the center. Each window focuses on the range:

\[ L - W / 2 < \text{HU} < L + W / 2 \]

Mimic doctors with ML

A viable approach is to choose 3 different windows and use them as the channels of a 3-channel image. In this way our network will try and learn, as a human doctor does, to classify haemorrhage using multiple windows of the same scan.

This approach works and was successfully used during the competition by lots of participants (including me). The approach is also backed up by several research papers.

Some examples of the images one obtains are shown in the pictures.

The images have been min-max scaled to then be fed to the network.

Pros

Quite a simple approach.
We are feeding the network the same information a human expert would use (we know it's meaningful).

Cons

We are dropping information that the network might be able to use.

Introduce a volume component

Another approach, that was quite successful in the competition was to introduce a volume component instead or together with using multiple windows.

As shown in the figure, scans are consecutive snapshots on the axial plane of the head.

Three consecutive scans can be used as the three channels of an RGB image (using still some windowing on the Hounsfield Scale).

An example of the input images (min-max scaled) is shown in the figure.

Pros

Quite a simple approach.
We are now giving information to the network about volume, although limited.

Cons

We are dropping information that the network might be able to use (for the windowing).

No windowing

An interesting approach developed during the competition was to drop windows altogether and give the network the full range of HU values. The nuance to tackle with this approach is that the distribution of the pixels over the full range is usually strongly bimodal, with values that are not evenly distributed in the whole range. The distribution can change a lot with the type of tissue that is mainly present in each scan.

A solution to this problem is to find (or craft) a nonlinear normalization function to "normalize" our data over the full range, or almost the full range.

An example can be found in the write-up of the 8th place solution to the competition.

Conclusions: We had a look at some possible approaches that work when dealing with classification in CT scans. We started explaining how scans work, how we can convert them to Hounsfield Units and which strategies we can use to feed the data to a neural network.

Do not write misusable APIs

Fri, 28 Feb 2020 20:13:50 +0000

APIs should be easy to use and hard to misuse.

— Josh Bloch

Today I found this quote on Twitter and for a moment I thought: I’m gonna print it and frame it right now!!

This is something I see a lot in my day to day life as a Software Engineer. Spending days and days looking for nasty bugs resulting in realising that I was misusing an API, taught me that a rule to live by it’s:

Make getting your APIs wrong really…really…really hard

There should be no doubt in the usage of an API, nothing should be left for the user to guess, everyone should be able to read your API and understands immediately how to use it with no doubts.

Let’s write a bad API.

Imagine we are writing a proto message for an API we are implementing, we want to model a shop transaction:

message Transaction {
    int64 timestamp = 1;
    string product_code = 2;
    float price = 3;
    string address = 4;
  }

This message it’s not the most usable, you look at it and question arise:

mmmm int64 timestamp, wait should I put there an epoch timestamp?
price should it be in the local currency?
address of what? presumably of the shop…maybe?

In this message, for at least 3 fields, it is not immediately and unmistakenly clear how to use them.

Let’s try and improve our interface, we can achieve this in multiple ways.

Good documentation

That’s a solution you see quite often and it’s a good solution. If your API is well documented, people will know how to use it (presumably).

// Transaction models a shop transaction. Each transaction involves only one product.
// Each product can be purchased in a shop.
message Transaction {
    // Unix epoch with nanoseconds indicating when the transaction was completed.
    int64 timestamp = 1;
    // Unique product code of the product involved in the transaction.
    string product_code = 2;
    // Price in the local currency of the product involved in the transaction.
    // If the address is not specified, then the currency must be in USD.
    float price = 3;
    // Address of the shop where the transaction happened.
    string address = 4;
  }

Ok good, now as a user, I have way more information, I know that a transaction it’s a shop transaction, that the timestamp is an epoch timestamp with nanoseconds, that the price is in the local currency of the store and that the address it’s the address of the shop.

There is still something bugging, right? The price can be either in local currency if the address is specified, or in USD if no address is provided. Mmmmm despite the comments that document the behaviour that does sound quite right.

Imagine you are a new hire, you don’t know anything about this message, you have to return all the transaction happened in the US, you notice that some of the transactions are missing the address, but hey all of them have the price. Let’s just fetch all the transactions in USD. That would return the wrong set of transactions…this could lead to bugs that are quite hard to find.

The problem here is one of cognitive load, still, the API doesn’t document itself fully, you still need to have some previous knowledge to use it (in this case that in the case of missing address the price is in USD).

As an additional point, comments will not be available in the classes generated for this message. A developer will always have to dig up the proto definition and read the comments.

Good naming

If the names of the fields speak for themselves and make themselves unmistakable, well…then it’s really hard to get it wrong.

// ShopTransaction defines a transaction happened in a shop.
message ShopTransaction {
    int64 epoch_timestamp_ns = 1;
    string purchased_product_code = 2;
    // Price of the transaction, always in USD despite the location
    // where the transaction happened.
    float price_usd = 3;
    string shop_address = 4;
  }

As you can clearly see at this point comments are basically superfluous. Each field tells exhaustively what it contains and what should be put in it. The currency it’s always in USD, so no assumption can be made on the purchase location using the price, and it’s clear from the name, the address it’s the address of the shop (clear from the name), the timestamp is epoch nanoseconds, clear from the name.

Good. We don’t need much commenting at this point, but we are still free to add them. We can still drop a line saying that the price is always in USD, no matter the address, but it’s not strictly necessary.

As a bonus, to feel fully happy about the message, I would probably change the int64 to a google Timestamp and maybe deal a bit better with the address field. Defining an address message might be a good idea but it really depends on the use case.

Re-identification with Triplet Loss

Wed, 26 Feb 2020 07:13:50 +0000

One very interesting computer vision problem is re-identification. The idea is that you have images of some entity and you want to be able to re-identify that entity in new images. As a complementary problem, you might also want to be able to say if an identity is known or not.

Classic use cases are people re-identification for surveillance, but there are also more fancy use cases such as whale re-identification for monitoring and conservation effort.

A classic way of solving the re-identification problem with Deep Learning is to train a CNN to learn an embedding space where different observations of the same entity will be mapped close together, or better closer than observation of a different entity.

Formally this approach, called learning metric embeddings, has the goal of learning a function that takes images in a space \(R^{F}\) to a space \(R^{D}\) where semantically similar points in the initial space are mapped to metrically close points. At the same time, semantically different points in the original space are mapped to metrically distant points.

What we want to learn it's the function

\[\textit{f}_\theta(x): R^{F} \rightarrow R^{D}\]

The function is usually parametric and can be anything from a linear transform to complex non-linear maps.

A way to tackle the problem is to train a neural network to learn that function. In this case, we can use one of the final layers of the network as the embedding space, we just have to come up with a loss function.

A typical approach at this point is to use a loss function that pushes points belonging to the same entity close togheter while pushing points belonging to different entities far away.

Let's define a metric \(D_{x, y}: R^D \times R^D \rightarrow R\) that measures a distance between the points \(x\) and \(y\) in \(R^D\).

In [1] the author proposed a loss function called Triplet Loss. The function is called triplet because it computes the loss over a triplet of points:

the anchor \(x_a\), which is a sample of one entity
the positive sample \(x_p\), which is another sample of the same entity used as anchor
the negative sample \(x_n\), which is a sample of a different entity.

The function mathematically is:

\[ L = \sum\limits_{a,p,n}[m + D_{a,p} - D_{a,n}]_+\]

where \([\bullet]_+\) it's the hinge function \(max(0, \bullet)\).

It is pretty straightforward to see that the loss is pushing the distance function \(D\) between the anchor and the positive sample closer to the distance between the anchor and the negative sample by at least a margin \(m\).

Usually, the Euclidean distance is used as the metric \(D\).

A modification can be made to the Triplet Loss to introduce what is called a soft margin. In this case, the hinge function is modified to be

\[\text{softplus} = log(1+e^x)\]

This yields mainly two advantages:

we remove one hyperparameter (\(m\))
the softplus function decays exponentially instead of having a hard cut-off like the hinge function. This means that triplets that already satisfies the margin \(m\) will still contribute a bit to the loss with the effect of still pushing/pulling samples as close or as far as possible.

Ok so let's give this a try in a real re-identification case.

A real-life re-identification problem

Let's use as a test case the whale identification task from last year Humpback Whale Identification Kaggle competition. The task for the competition was to train a model able to identify a whale by their fluke (which is unique for each whale, kind of like a fingerprint). This is a nice real-life case, the dataset it's unbalanced, noisy and there are lots of nuances:

it's not easy to take consistent pictures of moving flukes, so you will have a wide variety of viewpoints and occlusions (mainly water splashes)
flukes can slightly change in time due to injuries

Just for reference, that's what the images look like.

The full code for our experiment can be found here.

To simplify the problem, let's use a smaller dataset consisting of only the 10 whales with the highest number of occurrences. The histogram of the sample count for this smaller toy dataset is shown below.

For the task, we will use a pre-trained Resnet34 as the main feature extractor and we will add a final linear layer with \(D=128\), which will be the dimension of our metric space.

Let's see how the embeddings evolve in 2D during training, each colour represents a different whale.

How do we evaluate now our network?

Since we used the Euclidean distance, a solution it's to compute the embeddings for the validation set, for each of them find the nearest embeddings of the training set and use that information to infer the entities in the validation set. For the sake of this example, I just computed classification accuracy, assigning to each validation sample the label of the closest training sample.

I used the accuracy as the monitor variable for early stopping. After 55 epochs we got an accuracy of 0.93.

Some interesting variables to monitor while training for metric learning using the Triple Loss are the norms of the embeddings and the distances between embeddings. Let's have a look at the median and the p95 of those quantities as they evolve for any mini-batch.

As you can see, as the training proceeds, the embeddings are pushed to become larger and larger and be more and more distant between each other. These plots are also really informative to decide when to stop the training (more on this later).

Can we do better?

If you think about how we trained the network, we randomly got anchor samples, for each one of them we randomly selected positives and negatives. What usually happens is that the network learns quickly the easy triplets which start to be uninformative during the training process. A solution to this would be to present all the possible combination to the network during the training process, but that can become impractical as the number of samples grows.

The problem can be solved "mining" for hard triplets. What's a hard triplet?

A triplet can be defined hard when \(D_{a, p} > D_{a, n}\), that is the negative is closer to the anchor than the positive. Those are the triplets that need the biggest correction.

We have two ways of mining triplets, offline and online.

Offline triplet mining

We compute all the embeddings at the beginning of each epoch and then we look for hard (or semi-hard triplet when \(D_{a, n} - D_{a, p} < m \)). We can then train one epoch on the mined triplets.

Mining offline it's not super efficient, we need to compute all the embeddings and update the triplets often to keep our network seeing hard examples.

Online triplet mining

In online mining, we compute the hard triplets on the fly. The idea is that for each batch, we compute \(B\) embeddings (where \(B\) it's the batch size), we now use some smart strategy to create triplets from these \(B\) embeddings.

An approach called batch hard was proposed in [2], where you select the hardest positive and the hardest negative triplets in the batch.

Select for each batch \(P\) entities and \(K\) images for each entity (usually \(B\leq PK \leq 3B\)).
For all the anchors find the hardest positive (biggest \(D_{a,p}\)) and the hardest negative (smallest \(D_{a, n}\))
Train the epoch on the mined hardest triplets.

As a note on \(P\) and \(K\) size. \(3B\) it's the number of embeddings we would have to compute while mining offline. To get \(B\) unique triplets you will need \(3B\) embeddings.

There are lots of practical considerations to be made with this approach, for example:

Is the dataset clean? Are the hardest triplets impossible triplets that are just confusing the network?
In some cases you might not have \(K\) samples for each instance (few-shot learning), or you might have only 1 (one-shot learning). In this case, augmentation might be your friend. If you can heavily augment the samples you could use the same images to reach \(K\).
Overall, it might be a good idea to do a first round of training without mining to bootstrap the network and then later switch to hard triplets mining.

Still each use case it's different, so the best thing to do it's experimenting.

Ok, let's retrain using hard batch online mining and let's see how our network behaves.

After 47 epochs, our training stopped reaching 0.95 accuracy.

This is the embeddings evolution during training.

Let's have a look again and the evolution of norms and distances of the embeddings.

In this case, it is even more relevant to have a look at the distance/norm plots to decide when to stop training. What can happen is that the loss my appear stagnating, since as soon as the network has learnt hard cases, new ones will be presented. For example, looking at the graph we could have probably trained the model more.

Another useful number to be checked to see how training is going it's the number of active triplets, that is the number of triplets with non-null loss.

Conclusions: We had an in-depth look at how to solve the re-identification problem using Deep Learning. We understood the triplet loss and how it can be improved using triplet mining. We had a look at a real-life re-identification example and solved it with the concepts we learned.

[1] FaceNet: A Unified Embedding for Face Recognition and Clustering

[2] In Defense of the Triplet Loss for Person Re-Identification

Everything you need to know about multi-object tracking

Fri, 21 Feb 2020 07:13:50 +0000

I find Multiple object tracking (MOT) a very interesting problem. In the case called tracking-by-detection, you have a bunch of detections of objects (they can either be in 2D or 3D) and you have to associate detections in time figuring out if they are observation of the same object.

More formally, we can define the problem as a multi-variable estimation problem.

Given a set of frames, we have a set of states of objects in each frame. Let's call \(s_j^{i}\) the state of the object \(i\) in frame \(j\), all the \(M_j\) objects in the \(j\)-th frame are the set \(S_j = \{s^{1}_j, s^{2}_j, ..., s_j^{M_j}\}\). The set of the states \(S_{1:t} = \{S_1, S_2, S_3, ..., S_t\}\), defines all the states for all the objects in the frame sequence.

Now we have a set of observations for each frame \(O_{1:t} = \{O_1, O_2, ..., O_t\}\), where \(O_j = \{o^{1}_{j}, o^{2}_{j}, ... o^{M_j}_{j} \}\) are all the observations for frame \(j\). Note that for the sake of the notation we are assuming that we have exactly one observation for every and each object, \(M_j\) states and \(M_j\) observation at frame \(j\).

Now the problem that we want to solve is to find the "optimal" sequence of states given the observations. This can be solved as a maximum a posteriori estimation (MAP) problem

\[\hat{S}_{1:t} = \text{argmax}_{{S_{1:t}}} P(S_{1:t}| O_{1:t})\]

Usually, this can be solved either with a probabilistic approach or with an optimization approach. The former usually works online (more on this later) the latter is usually more suited for offline tracking since you want to optimize and find the global optimum on the whole frame sequence. This approach is also known as non-causal since you are using the future and past observations at the same time.

Probabilistic approach

Usually, to solve the problem with a probabilistic approach you can adopt a two-step iterative process:

you predict the state at the next step without using the observations (predict)
you correct your prediction with the observations (update).

To perform the predict step you need some dynamic model that you can use to compute predictions. To perform the update step you need some measurement/observation model that ties the observations back to the state so that you can perform the correction.

More formally:

\[ \textit{Predict}: P(S_t|O_{1:t-1}) = \int P(S_t|S_{t-1})P(S_{t-1}|O_{1:t-1})dS_{t-1}\]

\[ \textit{Update}: P(S_t|O_{1:t}) \propto P(O_t|S_t)P(S_t|O_{1:t-1})\]

Where \(P(S_t|S_{t-1})\) is the dynamic model that tells us how the states are supposed to evolve in time, and \(P(O_t|S_{t})\) is the measurement model.

Note that to be able to formulate this solution to the problem, we are assuming that the Markov assumption holds (past and future are independent given the current state).

Pros

Works online.
Can be less heavy computationally.

Cons

Might not provide a global optimum, since we are not using the whole sequence.

Optimization approach

A second approach is to solve the estimation problem via optimization either of the Likelihood or minimizing an energy function.

More formally

\[ \hat{S}_{1:t} = \text{argmax}_{S_{1:t}} P(S_{1:t}| O_{1:t}) = \text{argmax}_{S_{1:t}} L(O_{1:t} | S_{1:t})\]

or considering an Energy function

\[ \hat{S}_{1:t} = \text{argmax}_{S_{1:t}} P(S_{1:t}| O_{1:t}) = \text{argmax}_{S_{1:t}} E(S_{1:t} | O_{1:t})\]

Note that models and in general knowledge about the expected behaviour of the objects can be injected also in the optimization approach. One very used approach is to enforce motion constraints through the function E.

Pros

Converge to a global optimum.

Cons

"Heavier" computationally.
Works offline (you are using the future).

The Models

Let's talk a bit about the models which I find to be a very interesting aspect of MOT.

You have two problems to solve

how to measure the similarity between objects across frames
how to use that similarity information to recover identity across frames.

Roughly speaking, the first problem involves usually modelling the appearance or the motion of an object. While the second is the inference problem. Appearance here is used as a generic term, that could be the visual appearance if you are using a camera.

Two widely used approaches for modelling in MOT are appearance models and motion models. The former uses how an object appears to the sensor, the latter uses the expected motion of the object.

Let's have a look at examples, one of the simplest motion model consists of assuming that from one frame to the other an object didn't move much. If I have an observation in the frame \(j\) and I have a "close" observation in frame \(j+1\) I will associate them to the same object. What does close mean? I can for example measure the distance between the centroids, or I can use intersection over union, that is if the two boxes intersect more than a certain threshold they are matched in time.

This is a pretty simple approach that works. The main problems come from occlusions and the assumption (which might not hold) that the rate at which frames are captured it's "high" enough to capture very small motions in the observations. In the case of occlusions, you will likely experience id switches. This is given by the fact that boxes of different objects will overlap for some frames.

Let's see how a centroid tracker behaves for example. These results are obtained using the Oxford Towncentre Database for pedestrian tracking.

Detections are already available to be used for tracking.

As you can see the tracker works, but there are cases where id switch does happen, especially when the scene gets more crowded.

Now, if we want to make the tracker more robust we could either use an appearance model and use information about how the detected object looks or use a motion model and make assumptions about the motion of the detected objects (e.g., in the case of a pedestrian we can assume that the object will move with constant velocity).

Appearance models

Appearance models include two components:

a representation of the object appearance
a measurement of the distance between such two representations

In the case of visual tracking, lots of different representations can be used, such as local features (or deep features) of the image, colour histogram, HOG, etc...

In general gradient-based features, like HOG can describe the shape of an object and are robust to lightning changes, but they cannot handle occlusion and deformation well. Region covariance matrix features are more robust as they take more information into account, but they are more computationally expensive.

The distance between two representations can be computed in several ways, mainly depending on the appearance model used.

Let's see how our tracker improves using an appearance model.

As you can see the tracker is more robust to occlusion. This is given by the fact that we are using information about the appearance of the tracked objects so we don't "confuse" it with a different occluding object.

Motion Models

As a final topic let's have a look at motion models. Motion models assume knowledge about how an object moves and predict the expected position of the object. The predicted position is later corrected and updated with the measurement, which is now matched to the predictions.

A very common way to use motion models in the probabilistic iterative approach is to use Kalman Filters. A very common assumption is that the objects move with constant velocity or constant acceleration.

Let's have a look again at how using motion models improves our tracker.

Here we are using a Kalman Filter with a constant velocity model. The tracker is still robust to occlusion since we are predicting the future position of each object using the motion model.

The models solve the problem of how to measure similarity, the second problem of using the similarity to recover identity can be solved in several different ways. In the presented demo cases, it was solved by optimization of the intersection over union between the tracker tracks after the update and the observations.

The examples were created using modified versions of tracking code from this repository.

Conclusions: We had an in-depth look at the multi-object tracking problem, how it can be formalized formally and solved. We had a look at some classic ways of solving it and we also had a look at the real-life example of pedestrian tracking

The eight-points algorithm

Wed, 19 Feb 2020 07:13:50 +0000

In this blog post we had a look at how to estimate the optical flow (e.g., track how pixels move in time) in a set of images. The estimation we obtained gave us pixel matches across the image set.

Given the correspondences between two images, we can estimate the motion and the 3D position of the points we are observing. Solving this problem is known as Structure from Motion (SfM).

Let's assume we are observing a 3D point \(X\), in two different images. The two viewpoints are related by an affine transformation (rotation plus translation) given by the matrix \(R\) (for rotation) and by the vector \(T\) for the translation. If we draw an imaginary line between the image centres \(c_1\) and \(c_2\) to the 3D point, we have that the point is projected to the point \(x_1\) (in image space) in the first image and to the point \(x_2\) (in image space) in the second image.

In the camera frame, we can also write that \(\lambda_1 x_1 = X\) and that \(\lambda_2 x_2 = X\), being \(\lambda_1\) and \(\lambda_2\) the scaling factors to go from the points in the image to the point \(X\).

So, let's say that we know \(x_1\) and \(x_2\) (one of the matches we found between the two images), how can we recover \(R\), \(T\) and \(X\)?

Let's try and rewrite everything in the second camera frame.

\[ \lambda_2x_2 = R\lambda_1 x_1 + T\]

We can multiply for the skew-symmetric matrix of T

\[ \lambda_2 \hat{T}x_2 = \lambda_1 \hat{T}Rx_1 \]

we can then multiply for \( x_2^T \) and divide by \(\lambda_1\)

\[ x_2^{T}\hat{T}Rx_1 = 0 \]

Note: the term on the left gets to zero because \( T \times x_2\) is orthogonal to \(x_2\) so if you compute the scalar product for \( x_2 \) you get zero.

Now we have an expression that couples the camera motion and the two known 2D locations. This equation is called the epipolar constraint. Note that the 3D points \(X\) do not appear in the equation, we successfully decoupled the problem of computing \(R\) and \(T\) from the problem of computing the 3D coordinates of \(X\).

Geometrically, the epipolar constraint says something pretty straightforward. If you look at the first picture: the volume spanned by the vectors \(x_2\), T (\(\vec{o_2o_1}\)) and \(Rx_1\) (which is \(\vec{c_1x_1}\) seen from the second camera) has a zero volume, thus the triangle \((c_1c_2X)\) lies on a plane.

The epipolar constraint can be rewritten as:

\[ x_2^{T}Ex_1 = 0 \]

\(E\) is called the essential matrix, and it has the following property:

\[eig(S) = (\sigma, \sigma, 0)\]

That is the essential matrix has three eigenvalues, two are equals and one is zero.

\(R\) and \(T\) can be extracted from the essential matrix. Usually what we do in practice is that we find a matrix \(F\) that solves the epipolar constraint and then we compute the "closest" essential matrix (projecting \(F\) to the space of the essential matrices).

The eight-points algorithm

To solve the equation in \(E\) we need to re-write it in such a way to separate known variables (\(x_1\) and \(x_2\)) from the unknown \(E\).

If we stack the columns of \(E\) in a single vector \(E^{s}\) and we use the Kronecker product of \(x_1\) and \(x_2\) (\(a\)) we can write

\[ x_2^TEx_1 = a^{T}E^{S} = 0 \]

Now, we can stack this equation for all the matches we have between the two images and obtain the following linear system which contains all the epipolar constraints for all the points

\[ \chi E^{S} = 0 \quad \text{with } \chi = (a^{1}, a^{2}, ..., a^{n})^{T} \]

You can immediately see that the solution to the system is not unique and that every scaling factor multiplying \(E^{s}\) will solve the equation. In practice, this means that we are not able to compute the baseline, that is the translation between the two cameras, but only its direction. The solution is to consider the baseline equals to one and compute everything in "baseline units".

To have a unique solution at this points we need at least 8 points (that's what gives the name to the algorithm)

Once we solved for a generic matrix \(F\), we can find the closest \(E\) by doing

\[ \begin{matrix} F = U \text{diag}(\lambda_1, \lambda_2, \lambda_3) V^{T} \phantom{..........}\\ E = U \text{diag}(\sigma, \sigma, 0)V^{T} \quad \sigma = \frac{\lambda_1+\lambda_2}{2} \end{matrix} \]

As we said before, there is a scaling factor that we cannot reconstruct, to fix the scale we can impose \(\sigma = 1\), obtaining a final essential matrix \(E = U \text{diag}(1, 1, 0) V^T\).

Caveats

\(E = 0\) is a solution in which we collapse everything to a point, it's a valid solution but we don't care about it
There degenerate cases, (e.g., all the matches lie on a line or plane) where no matter how many points you have you cannot have a unique solution
We cannot get the sign of E (also \(-kE^{s}\) is a solution), so we have 4 possible combinations for R and T. The solution to the problem is to pick the \(R\) and \(T\) couple which gives positive depth values (the 3D points are in front of the camera).
If \(T = 0\), that is there is no translation the algorithm fails, but this never happens in real life.

How do we extract the possible combinations of \(R\) and \(T\)?

Given

\[ W = \begin{pmatrix} 0 & -1 & 0 \\ 1 & 0 & 0 \\ 0 & 0 & 1 \\ \end{pmatrix} \]

which describes a rotation of \(\pi/2\) aroubd \(z\), we have four possible solutions given by two rotation matrices \(R_1\) and \(R_2\) and two translations \(T_1\) and \(T_2\).

\[ R_1 = UWV^{T} \qquad R_2 = UW^{T}V^{T} \]

\[ T_1 = U_3 \qquad T_2 = -U_3 \]

Let's have a look at a toy example using Python, full code here.

Let's generate a fixture world with two cameras and eight 3D points. In the image, each frame is represented with r (x-axis), g (y-axis) and b (z-axis).

Now, we assume that our cameras have a focal length of one and we transform the points into the normalized image space. The resulting images for both the cameras are represented in the figure, where colours match point correspondences.

From the points, we can compute the Kronecker product and extract our estimated essential matrix.

def _extract_rot_transl(U, V):
    W = np.array(([0, -1, 0], [1, 0, 0], [0, 0, 1]))
    return [
        [np.dot(U, np.dot(W, V)), U[-1, :]],
        [np.dot(U, np.dot(W, V)), -U[-1, :]],
        [np.dot(U, np.dot(W.T, V)), U[-1, :]],
        [np.dot(U, np.dot(W.T, V)), -U[-1, :]],
    ]


chi = _compute_kronecker(points_1, points_2)
_, _, V1 = np.linalg.svd(chi)
F = V1[8, :].reshape(3, 3).T
U, _, V = np.linalg.svd(F)
possible_r_t = _extract_rot_transl(U, V)

Let's compare the results we got with the original rotation and translation from camera_1 to camera_2.

One of the solutions we get:

R = [[ 0.95533649, -0.        ,  0.29552021],
     [ 0.0587108 ,  0.98006658, -0.18979606],
     [-0.28962948,  0.19866933,  0.93629336]]
t = [-1.,  0.,  0.]

With the original \(R\) and \(T\):

R = [[ 0.95533649, -0.        ,  0.29552021],
     [ 0.0587108,   0.98006658, -0.18979606],
     [-0.28962948,  0.19866933,  0.93629336]]
t = [-1.5,  0.,  0.]

As you can see we were able to fully recover \(R\) and \(T\), but only up to a scaling factor.

Conclusions: We had an in-depth look at the eight-points algorithm to reconstruct the affine transformation between two camera poses observing the same 3D points. We formally introduce the algorithm, discussed caveats and we had a look at a real example using synthetic data in Python.

All you need to know about Learning Tests

Thu, 13 Feb 2020 07:13:50 +0000

Picture this scenario: You have to solve a new task, a new amazing coding problem, after some googling, you find a library that solve part of the problem for you. Great! You write your code using the library, you write a test…and

FAIL!

You tweak the code a bit…FAIL!

I think we all went through this multiple times during our developer career.

What is happening is that, you found a new library to implement a certain behaviour that you want, you think you understood the library (but you have this bugging feeling that maybe you did not), you think you know how to use it for your particular use case (but you have this bugging feeling that maybe you do not).

What’s a good approach? There is one simple answer: Learning tests

What is a learning test?

A learning test is a test you write to test your understanding of a third party API library. You basically write some tests in which you use the library as you will do in your production code and you check that the behaviour is what you expect.

The point here is that you are NOT testing the library (it should have its own tests), you are testing your understanding of it.

Why you should write learning tests?

An alternative would be to perform your own experiments using the library and then, when you are sure about its behavior, just use it in the production code.

While this may suffice, there are indeed several advantages in writing your “experiments” as actual tests.

You would write the experiments anyway, so you are not adding any coding overhead.
Learning tests protect your code against changes in the library itself. If a new version is released where a behaviour (or interface) is changed, you will immediately see your tests fail. This will prevent you hours of painful debugging, only to understand that you’re using a version of the library that is not compatible anymore with your code.

Let’s make an example of a learning test

Disclaimer: the example is trivial and probably everything can be solved beforehand reading the documentation accurately.

Let’s say we have a data structure myStructWithTime abstracting some data with a timestamp and we want to write a function to search by timestamp in an slice of our data structure.

After some research we encounter the sort package in GO and we decide to give a try to its Search function. The package provides functionalities to sort slices and user-defined collections.

After a little bit of digging in the documentation, we think we got the mechanism. We write our search function

// MyStructWithTime a structure with time.
type MyStructWithTime struct {
	foo       int
	timestamp time.Time
}

func findInStruct(in []MyStructWithTime, query time.Time) int {
	i := sort.Search(len(in), func(i int) bool {
		return in[i].timestamp.After(query)
	})
	if i < len(in) && in[i].timestamp.Equal(query) {
		return i
	}

	return -1
}

We then write a test in which we use the library in the same way we would in our production code. First, it is not clear for us if the slice must be already sorted before using sort.Search, so we write a test and see what happens.

earlier := time.Date(2020, time.January, 1, 2, 1, 0, 0, time.UTC)
later := time.Date(2020, time.January, 1, 5, 1, 0, 0, time.UTC)

testcases := []struct {
  name     string
  input    []MyStructWithTime
  query    time.Time
  expected int
}{
  {
    name: "not_sorted",
    input: []MyStructWithTime{
      {timestamp: later},
      {timestamp: earlier},
    },
    query:    earlier,
    expected: 1,
  },
}

You run the test and the result is:

--- FAIL: TestSort (0.00s)
    --- FAIL: TestSort/not_sorted (0.00s)
expected 1, got -1
FAIL
exit status 1
FAIL	0.005s
Error: Tests failed.

Probably we are doing something, wrong, probably the slice need to be already sorted, so we change the struct to

{
    name: "sorted",
    input: []MyStructWithTime{
      {timestamp: earlier},
      {timestamp: later},
    },
    query:    earlier,
    expected: 0,
},

and we re-run the test

--- FAIL: TestSort (0.00s)
    --- FAIL: TestSort/not_sorted (0.00s)
expected 1, got -1
FAIL
exit status 1
FAIL	0.005s
Error: Tests failed.

again…

There must be something that we are missing here…We dig a bit more into the documentation, especially in the time package documentation, and we discover that After is not inclusive. From the sort documentation we got that we need to test for >= in a case of ascending sorted slice…Perfect!

Let’s fix the function

func findInStruct(in []MyStructWithTime, query time.Time) int {
	i := sort.Search(len(in), func(i int) bool {
		return in[i].timestamp.After(query) || in[i].timestamp.Equal(query)
	})
	if i < len(in) && in[i].timestamp.Equal(query) {
		return i
	}

	return -1
}

we hit the run button…and

Running tool: /usr/local/bin/go test -timeout 30s -run ^(TestSort)$

PASS
ok  	    0.005s
Success: Tests passed.

Success!!

We understood how we should use the library, and in the meantime we learnt a great deal about the sort and time packages.

At this point, the test can be factored into two test cases, which will be added to our test code base:

A test expecting failure for an array which is not sorted.
A working test where we put everything together.

These three tests will make sure that, if something changes in the sort.Search, we will be immediately notified by a test failure.

Conclusions: Anytime you are facing a new library, do not limit yourself to write some experimental code to understand its use. A better approach is to write learning tests in which you use the library as you would do in your production code. In this way you’ll test your actual understanding of the library and you’ll protect your code from disruptive changes from third parties.

Everything you need to know about the Lucas-Kanade tracker

Tue, 11 Feb 2020 07:13:50 +0000

The Lucas-Kanade-Tomasi (LKT) tracker is one of the most used trackers in computer vision. It's easy to implement and understand, it's fast to compute and it works fairly well.

The tracker is based on the Lucas-Kanade (LK) optical flow estimation algorithm. The problem of optical flow estimation is the problem of estimating the motion of the pixels in an image across a sequence of consecutive pictures (e.g., a video).

The idea of the LK estimation is pretty straightforward.

Now, let's imagine you are observing the image from a small hole of the size of a pixel. If you know the gradient of the brightness, if you move the image, you can infer something about the direction of the movement. This is true only if the brightness cannot change for any other reasons other then motion.

We just introduced the first basic assumption of the LK tracker: the brightness of each pixel does not change in time, as each point moves in the image, it will keep it's brightness constant.

Let's exemplify with a drawing.

Let's assume at time t you are observing a pixel of an image (left image), and you know that the brightness is increasing towards left and down (the arrows show the gradient of the brightness). At the next time instant, after the camera moved (right image), you notice that the brightness observed through the pixel increased, given that the brightness does not change for any other reason, you can safely assume that the underlying object observed by the camera, has moved up and right (black arrows), or conversely, the camera moved with a certain velocity v down and left.

You can immediately notice one possible problem: what if the brightness doesn't change for the point we are observing? Or what if the brightness doesn't change in a certain direction? This is called the aperture problem. You can only perceive motion in the directions that are not orthogonal to the direction of the gradient. For example, if you observe a pixel in a monochrome patch you won't be able to perceive any motion, or if you are observing a pixel on a straight edge, you cannot perceive any movement along the edge. Luckily, in natural images it's really hard to find this scenario, usually zooming to different levels will usually give you some texture with a brightness gradient in both x and y directions. An alternative solution is to observe a window around a pixel, increasing the likelihood of a "full" brightness gradient. It is to be noted that if you use a window you are implicitly assuming that all the pixels in the window move in the same way, for this assumption to be safely made you need to have very small displacements and a properly sized window, otherwise, for complex motions you will easily break it.

Let's now have a look at the math. If you assume that the brightness (I) remains constant for each pixel (x) in time, you can write:

\[I(x(t), t) = \text{const} \Rightarrow \frac{dI}{dt} = 0\]

Applying the chain rule to compute the derivatives we get

\[\nabla I^{T}\frac{dx}{dt}+\frac{\partial I}{\partial t} = 0\]

If you observe the equation, you can see that it exactly describes the intuition we had about brightness changes and motion, and it becomes particularly clear if you rewrite it as:

\[\nabla I^{T}v = -\frac{\partial I}{\partial t}\]

where we called v the velocity of the camera motion. The equation basically says that the delta in brightness given by the velocity of the camera motion accounts for the total change of brightness in time. The velocity vector v is the unknown.

The aperture problem is clearly visible now. On the left, you have the scalar product of the gradient of the brightness and the velocity of motion. Any velocity orthogonal to the gradient will result in a null change in brightness, thus every velocity will satisfy the equation.

In the LK paper the authors proposed to solve the equation in the least square terms, that is finding the v that minimizes the equation. If we consider the case of a window (W) around the pixel x:

\[E(v) = \int_{W(x)} |\nabla I^{T}v + \frac{\partial I}{\partial t}|^{2}dx' \]

This function is quadratic in v, thus it's optimum is where the derivative is equal to zero:

\[\frac{dE}{dv} = 0 \Rightarrow v = -M^{-1}q\]

where

\[M = \int_{W(x)}\nabla I\nabla I ^{T}dx'\]

and

\[q = \int_{W(x)}\frac{\partial I(x')}{\partial t}dx'\]

As you can clearly see from the derivative expression, the matrix M which is called the structure tensor, is a 2x2 matrix. If \(det(M) = 0\), we have a patch with constant brightness, thus we are not able to solve in v, since M is not invertible. If \(det(M) = 2\) we can find v and the solution is unique. If \(det(M) = 1\) we can only find the component of v in one direction.

In the case presented, we are estimating motion given only by translation, the math can be simply modified to estimate an affine transformation (rotation plus translation) in the following way

\[E(v) = \int_{W(x)} |\nabla I^{T}S(x')p + \frac{\partial I}{\partial t}|^{2}dx' \]

where the affine transformation is modeled with a parametric model:

\[S(x)p = \begin{pmatrix} x & y& 1&0&0&0\\ 0 &0&0&x&y&1 \end{pmatrix}\begin{pmatrix} p1&p2&p3&p4&p5&p6\\ \end{pmatrix}^{T}\]

Now we can easily solve \(dE/dp = 0\).

Let's see an example using OpenCV and Python (you can find the full code here).

video = cv2.VideoCapture(input_video) 
success, frame = video.read()
previous_frame_gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
previous_points = cv2.goodFeaturesToTrack(previous_frame_gray, **DEFAULT_FEATURES_PARAMS)

First we open the video stream, read the first frame and find pixels to track. OpenCV provides the function goodFeaturesToTrack. Under the hood, the function finds the most prominent corners of the image using the Shi-Tomasi corner detector.

If you had to do it in practice, a simple way to identify good points is to compute the matrix M for all the pixels in the image and choose a set of points for which \(det(M)\) is greater than a certain threshold.

An alternative is to use the Harris corner detector. The idea is to weight the matrix M with a Gaussian centred on the window W centre

\[M = G_{\sigma}\nabla I \nabla I ^{T}\]

and then choose pixels such that

\[C(x) = det(M) + k*tr^{2}(M) > \vartheta \]

Intuitively, the eigenvectors of M tell the direction of maximum and minimum variation of the brightness, while the eigenvalues tell the amount of variation. In particular, if the eigenvalues are both low we are in a flat region (there is not much change in gradient), if one of the eigenvalues is bigger then the other we are on an edge, and if both the eigenvalues are high, we are probably on a corner (brightness changes in both the directions).

The Gaussian improves the results, weighting M based on the distance from the centre.

Now, if you remember from linear algebra

\[C(x) = det(M) + k*tr^{2}(M) = \lambda_1 \lambda_2 + k(\lambda_1+\lambda_2)^{2}\]

so the criteria that we are using to choose the points will yield a higher value if both the eigenvalues are high.

In OpenCV, you can specify to use the Harris detector (you can check in the code how).

Once we found the first interesting points, we can just iteratively extract a new frame and use the LK tracker to compute the optical flow.

success, frame = video.read()
frame_gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
new_points, st, _ = cv2.calcOpticalFlowPyrLK(previous_frame_gray,
                                             frame_gray,
                                             previous_points, 
                                             None,
                                             **DEFAULT_LK_PARAMS)

That's what the results look like:

If you check the full code you will notice that the LK tracker has some termination criteria argument. This is interesting because we didn't talk about LK being an iterative algorithm (you iterate on the frames but you do not iterate on each frame). The parameter is used because OpenCV uses a more robust version of LK, which uses "pyramids". One of the main assumptions of the LK algorithm is that we are dealing with very small motions (~1 pixel) and this is never the case, especially with high-res cameras. A solution is to use a coarse to fine approach (a sort of resolution pyramid). We start making the image more coarse (bigger pixels will result in smaller motions) and compute the tracking after we estimated the flow at a coarser scale, we then make the image finer and go on iteratively at higher and higher levels of resolution.

Conclusions: We had an in-depth look at the Lucas-Kanade tracker to estimate the optical flow from a sequence of images. We introduced the Harris corner detector and we had a look at a real-life example using OpenCV.

Proto nested messages and repeated fields in Python

Tue, 04 Feb 2020 20:13:50 +0000

Today I was having some problems populating a proto repeated message in Python with a nested message definition, and it took me a while to figure out how to do it.

In reality it is pretty simple. Let’s make an example.

syntax = "proto3";

package test;

message Trajectory2d {
    message Point2d {
      float x = 1;
      float y = 2;
    }
    repeated Point2d points = 1;
  }

Let’s save our test.proto and generate the Python code.

protoc --proto_path=. --python_out=. test.proto

Now if we want to create an element of Trajectory2d type and add points to it, we can just use the add() function. The function will create a new message object, append it to the list of repeated objects, and return it for the caller to fill. In addition it will forward keyword arguments to the class.

from test_pb2 import Trajectory2d

trajectory = Trajectory2d()
trajectory.points.add(x=10, y=30)
trajectory.points.add(x=14, y=22)

assert len(trajectory.points) == 2

assert trajectory.points[0].x == 10
assert trajectory.points[0].y == 30

assert trajectory.points[1].x == 14
assert trajectory.points[1].y == 22

Table-driven tests in Python

Sun, 02 Feb 2020 02:13:50 +0000

Table-driven tests are an elegant and functional way to unittest your functions in Go. Let’s see some ideas on how to introduce this same testing pattern in Python.

What are table-driven tests

One thing I really love about Go is table-driven tests. If you are not familiar with them, table-driven tests are a very elegant way to write unittests for your code. The basic idea is that you write a list of named test cases, defining the input and the expected output for each test case, then you loop over the cases, run your function and check that the actual output is equal to the expected one.

An example in Go looks like this, let’s imagine we want to test a sorting function we wrote:

func TestMySort(t *testing.T) {
	testcases := []struct {
		name     string
		input    []float64
		expected []float64
	}{
		{
			name:     "empty_slice",
			input:    []float64{},
			expected: []float64{},
		},
		{
			name:     "already_sorted",
			input:    []float64{1, 4, 6, 8},
			expected: []float64{1, 4, 6, 8},
		},
		{
			name:     "not_sorted",
			input:    []float64{1, 8, 3, 5},
			expected: []float64{1, 3, 5, 8},
		},
	}

	for _, tt := range testcases {
		t.Run(tt.name, func(t *testing.T) {
			actual := mySort(tt.input)
			assertEqualSlices(t, tt.expected, actual)
		})
	}
}

As you can see, we wrote three named test cases (empty slice in input, input already sorted and input not sorted). The final part of the code is just looping and asserting that for each test case we got the expected value.

What I think it’s really great about table-driven tests is that they allow you to naturally write very modular and concise tests, focusing on test data and expected behaviours. I also find that from a psychological viewpoint, they help you reasoning more in depth about test cases and in general be more thoughtful on what input could break your code.

When I switch to Python, I always feel like I’m missing table-driven tests and I always end up finding Pythonic ways of implementing them.

Here a couple ideas I came up with.

Python dicts

One simple and yet effective way of implementing table-driven tests in Python is using dicts. Let’s see an example, with the same sorting function.

import unittest

class TestMySort(unittest.TestCase):
    def test_my_sort(self):
        testcases = [
            {"name": "empty_slice", "input": [], "expected": [],},
            {
                "name": "already_sorted",
                "input": [1, 4, 6, 8],
                "expected": [1, 4, 6, 8],
            },
            {"name": "not_sorted", "input": [1, 8, 3, 5], "expected": [1, 3, 5, 8],},
        ]

        for case in testcases:
            actual = my_sort(case["input"])
            self.assertListEqual(
                case["expected"],
                actual,
                "failed test {} expected {}, actual {}".format(
                    case["name"], case["expected"], actual
                ),
            )

The main advantage of this approach is that it’s simple, understandable and it is compatible with every Python version.

The main problem I see is that there is not much protection around the testcase datastructure. You could make a mistake and the dictionaries could have different unexpected keys or different types. Typing could be enforced, but still the best you can do is defining the test cases type as List[Dict[str, Any]], which is not very strict.

Data Class

If you are using Python 3.7 you can use data classes. A data class is a class containing mainly data, the advantage is that it comes with already pre-defined methods, such as init() and repr() making you save time when coding.

Let’s see how can we use them for table-driven tests.

import unittest
from dataclasses import dataclass
from typing import List


class TestMySort(unittest.TestCase):
    def test_my_sort(self):
        @dataclass
        class TestCase:
            name: str
            input: List[float]
            expected: List[float]

        testcases = [
            TestCase(name="empty_slice", input=[], expected=[]),
            TestCase(name="already_sorted", input=[1, 4, 6, 8], expected=[1, 4, 6, 8]),
            TestCase(name="not_sorted", input=[1, 8, 3, 5], expected=[1, 3, 5, 8]),
        ]

        for case in testcases:
            actual = my_sort(case.input)
            self.assertListEqual(
                case.expected,
                actual,
                "failed test {} expected {}, actual {}".format(
                    case.name, case.expected, actual
                ),
            )

Overall using data classes gives you a cleaner solution compared to dicts, since you can easily enforce typing.

In this article we quickly had a look at what are table-driven tests in GO and why they are a nice feature. We then explored possible solutions to implement table-driven tests in Python.

Posts on Lorenzo Peppoloni

What is __slots__ in Python?

tl;dr

Python classes attributes

No dynamic attributes

Inheritance

Why to use __slots__?

Further reading

Is GO good at Math?

Pre-requisite

Solving a linear system

Neural Network

Plotting

Effective strategies for classification in CT scans

The task

Scans

Mimic doctors with ML

Introduce a volume component

No windowing

Do not write misusable APIs

Good documentation

Good naming

Re-identification with Triplet Loss

A real-life re-identification problem

Offline triplet mining

Online triplet mining

Everything you need to know about multi-object tracking

Probabilistic approach

Optimization approach

The Models

The eight-points algorithm

The eight-points algorithm

Caveats

All you need to know about Learning Tests

What is a learning test?

Everything you need to know about the Lucas-Kanade tracker

Proto nested messages and repeated fields in Python

Table-driven tests in Python

What are table-driven tests

Python dicts

Data Class

What is slots in Python?

Why to use `slots`?