Mobile Robotics

October 10, 2020

Contents

State Estimation
Mapping
Motion Planning
Sensors
Vision

State Estimation

We use probabilistic state estimation because the world itself is uncertain.

Perhaps the most important formula for this is Bayes Theorem:

$P(x | y) = \frac{P(y|x)P(x)}{P(y)} = \frac{\text{likelihood} \times \text{prior}}{\text{evidence}}$

In a lot of cases, obtaining $P(y | x)$ is much easier than $P(x | y)$.

In practice, the denominator (the evidence) can be obtained through some summation and normalisation. Therefore, we can think of the formula as $P(x|y) = \eta P(y|x)P(x)$, where $\eta$ represents this normalised summation.

Bayesian State Estimation

Suppose a robot makes an observation $z$ when trying to detect if a door is open. We begin with the robot not knowing any prior information about the state of the door - the door has equal probability of being open or closed.

What we’re after is the probability the door is open given our sensor measurement, $P(open | z)$. This cannot be directly measured, however.

Now suppose that we know if the door is open, then we would have had a probability of 0.6 of observing our measurement $z$. That is, $P(z | open) = 0.6$. Similarly if the door was not open, suppose we know the probability we would have made this measurement is 0.3, so $P(z | \lnot open) = 0.3$. Note that these probabilities do not necessarily need to sum to 1.

We can now apply Bayes theorem to obtain $P(open | z)$. The denominator represents the probability we have obtained the measurement $z$ ($P(z)$), so we enumerate each possibility leading to this result (in this case the door either being open or not open), and sum these probabilities as they are independent.

So:

$P(open|z) = \frac{P(z|open)P(open)}{P(z|open)P(open) + P(z|\lnot open)P(\lnot open)}$

Now substituting in our values, remembering our prior belief is that $P(open) = 0.5$:

$P(open|z) = \frac{0.6 \times 0.5}{0.6 \times 0.5 + 0.3 \times 0.5} = \frac{2}{3} = 0.666...$

Notice this probability is different from the 0.5 we previously assumed was the chance the door was open - by observing this measurement we have changed our belief about the state of the door.

Subsequent Measurements

Bayesian State Estimation relies on an important assumption - the Markov Assumption. This assumes that observed sensor measurements are independent of all their previous measurements, given some condition $x$. In the real world, this sometimes isn’t true, but as long as it is approximately correct it is ok in most cases.

In the Recursive Bayes method, we can then take the updated belief following our first observation, $P(open | z_1)$, and use it as the prior for subsequent measurements. Exactly the same method is used as above, and we end up obtaining $P(open | z_1, z_2)$. This can continue for as many measurements are required.

In practice, this should be implemented using dynamic programming. Following each iteration, the only information we are required to store is the updated belief, and this can then be directly used in the next iteration without having to recursively calculate all the priorities from scratch.

Bayes Filter

In a Bayes filter, we take the following inputs:

Stream of sensor measurements $z$ and actions $u$,
Observation (sensor) model $P(z_t | x_t)$,
- These are the probabilities of observing certain measurements given the real state conditions.
Action (transition) model $P(x_t | u_t, x_{t-1})$,
- This models what we expect our real state to be following some previous state having performed some action.
Prior (initialisation) $P(x)$ - whatever is known about the state beforehand.

We then output the posterior of $x$, that is the probability we are in any given state $x_t$ with respect to all the previous information.

Derivation

We’re after finding the belief for a given state given all previous observations and actions:

$Bel(x_t) = P(x_t | u_1, z_1, \cdots, u_t, z_t)$

The first thing we do is apply Bayes rule - we want this in terms of the probabilities of observations given actual state. Simplifying this with a normalising factor $\eta$ we obtain:

$\eta P(z_t | x_t, u_1, z_1, \cdots, z_{t-1}, u_t)P(x_t | u_1, z_1, \cdots, z_{t-1}, u_t)$

That is, we’re now looking at the probability of obtaining a given sensor value given our current belief about the state (which also depends on all previous observations). But, the Markov assumption applies here, so in fact the sensor observation does not need to depend on all previous observations - only the current state - so this can be simplified to:

$\eta P(z_t | x_t) P(x_t | u_1, z_1, \cdots, z_{t-1}, u_t, x_{t-1})$

Now, the right hand term (the prior) represents the probability of transitioning to state $x_t$, given that we were in $x_{t-1}$ and performed the previous sequence of actions and made the previous sensor observations. Similar to the door example above, we find this by considering the probability we would transition to $x_t$, given every possible previous configuration $x_{t-1}$. In the door example, that would be asking ‘what is the probability the door is now open, based on the probabilities of obtaining our previous sensor measurements given every possible state the door could have been in’. In that example, we could represent that with a summation, but in general this is an integral:

$\eta P(z_t | x_t) \int P(x_t | z_1, u_1, \cdots, z_{t-1}, u_t, x_{t-1}) P(x_{t-1}| z_1, u_1, \cdots, z_{t-1}, u_{t-1}) dx_{t-1}$

We can make a further simplification due to the Markov assumption - we assume our state only depends on the previous state and whatever action is performed, so the first term after the integral can be reduced:

$\eta P(z_t | x_t) \int P(x_t | u_t, x_{t-1}) P(x_{t-1}| z_1, u_1, \cdots, z_{t-1}, u_{t-1}) dx_{t-1}$

And now finally notice the second term in the integral - it is precisely our definition for $Bel(x_{t-1})$! Therefore we can substitute this in and finally obtain the complete formulation:

$Bel(x_t) = \eta P(z_t | x_t) \int P(x_t | u_t, x_{t-1}) Bel(x_{t-1}) dx_{t-1}$

We can see how this relates to the idea of taking some prior belief, making a prediction, and verifying this with measurements:

$\underbrace{Bel(x_t)}_{\text{Updated belief}} = \eta \underbrace{P(z_t | x_t)}_{\text{Sensor observation}} \underbrace{\int P(x_t | u_t, x_{t-1}) Bel(x_{t-1}) dx_{t-1}}_{\text{Prior belief and prediction}}$

Discrete Filter

The main problem with the Bayesian state estimation formula is the integral, so we make assumptions to simplify things. In a discrete filter, we assume everything is discrete. Importantly, this means we assume the state space is finite.

The integral then becomes just enumerating all possible states - they can be represented by an $N \times N$ matrix where $N$ is the total number of states. We can then model the entire process using matrix multiplication.

In our matrices for sensor observation probabilities, each row will represent a given state the robot could be in, with the columns representing the probability the sensor will report a given state, if the robot was in the actual state given by the row. Therefore, rows should sum to 1, though columns do not need to.

When a sensor measurement is made, we construct a matrix to use in our calculation representing $P(z_t | x_t)$ by taking the column of the sensor observation matrix corresponding to this measurement and constructing a diagonal matrix with these values.

Similarly, our prediction step is represented by a matrix of possible state transitions, with each row representing a possible state and values in corresponding columns representing the probability of transitioning to those states.

The problem with discrete filters is that even for relatively small numbers of states, the computational complexity grows very quickly, and for a lot of problems modelling in a discrete way is not realistic.

Examples coming soon…

Linear Gaussian Filter

Linear Gaussian filters are the first step into continuous state estimation. In these, we assume all the components required by the Bayesian formula can be modelled with gaussians.

Gaussians have the useful property that applying linear functions or multiplying together gaussians always results in another gaussian, which is the property which allows us to do this.

Kalman Filter

The Kalman Filter assumes we have a:

Continuous state space (but with discrete time),
- State vector $x$ is a continuous random variable, perhaps with multiple dimensions.
Everything is linear.
- States evolve linearly.
- Sensor measurements can be expressed with a linear function of the states.
- That is, the output state must depend on some linear function of input measurements and existing state.

This means, we can model the control using:

$x_t = A_t x_{t-1} + B_t u_t + \epsilon_t$

And we can model the sensor measurements using:

$z_t = C_t x_t + \delta_t$

Where…

$A$ is an $n \times n$ matrix which describes how the state would evolve from $t-1$ to $t$ without any control input or noise.
$B$ is an $n \times l$ matrix which describes how some control input $u_t$ changes the state from $t-1$ to $t$.
$C$ is a $k \times n$ matrix which describes how any given state $x_t$ maps to sensor observations $z_t$.
$\epsilon$ and $\delta$ are random variables representing noise, and are assumed to be independent and normally distributed.

Initialisation

We assume the initial state can be modelled as a gaussian. This is required because otherwise are subsequent steps would not still result in gaussians.

$Bel(x_0) = N(x_0; \mu_0, \Sigma_0)$

Prediction

In the prediction step, we calculate the posterior mean and error covariance using:

$\mu_t = A_t \mu_{t-1} + B_t u_t$
$\Sigma_t = A_t \Sigma_{t-1} A_t^T + R_t$

Update

In the update step, we first calculate the Kalman gain $K_t$, using:

$K_t = \Sigma_t C_t^T (C_t \Sigma_t C_t^T + Q_t)^{-1}$

We then update the state estimate and error covariance:

$\mu_t = \mu_t + K_t (z_t - C_t \mu_t)$
$\Sigma_t = (I - K_t C_t)\Sigma_t$

More filters TODO…

Mapping

Occupancy Grid Maps

Represent the environment as a grid.
- This grid is rigid.
Estimate the probability that each cell in the grid is occupied or free.
Large maps may require substantial memory resources.
Don’t rely on a feature detector.

Assumptions

The area which corresponds to a cell is either free or occupied.
The world is static.
Cells are independent of each other.
The probability distribution of the map is given by the product over the cells.
Robot positions are known.

In this mapping scheme, we update each individual cell using a binary Bayes filter.

We use log odds to represent occupancy, calculated using:

$l_{t, i} = \log \frac{p(m_i | z_{1:t}, x_{1:t})}{1 - p(m_i | z_{1:t}, x_{1:t})}$

Algorithm Outline

For all cells $m_i$…
1. If $m_i$ is in the perceptual field of $z_t$ (the sensor measurement) then…
  1. $l_{t,i} = l_{t-1, i} + \text{inverse_sensor_model}(m_i, x_t, z_t) - l_0$.
2. Otherwise $l_{t_i}$ doesn’t change, i.e. $l_{t, i} = l_{t-1, i}$.
Return occupancies ${l_{t,i}}$.

The inverse sensor model decides whether a cell is occupied or not given the sensor observations. It is called ‘inverse’ because it reasons from effects to causes, which is the opposite of the usual Bayes filters which reason from causes to effects. This can be learned using a supervised learning algorithm from the forward model.

SLAM

The SLAM problem involves trying to estimate the robot pose and map of the environment simultaneously.

Full SLAM attempts to estimate the entire path the robot has taken, that is $p(x_{0:t}, m | z_{1:t}, u_{1:t})$ - where $x$ is the robot position, $m$ is the map, $z$ are the sensor measurements, and $u$ are the control inputs.

Online SLAM only attempts to estimate the current pose, that is $p(x_t, m | z_{1:t}, u_{1:t})$. In practice, online SLAM is done recursively, similar to how recursive Bayes is implemented.

There are many considerations to make when developing a method for SLAM:

Dense mapping vs Sparse mapping.
Topological mapping vs Metric mapping.
Known correspondence vs Unknown correspondence.
Still environment vs Dynamic environment.
Small uncertainty vs Large uncertainty.
Passive mapping vs Active mapping.
Single robot vs Multi-robot.

There are three main paradigms for SLAM:

Kalman Filter based.
- EKF-SLAM
Particle Filter based.
- Fast-SLAM
Graph based.
- LS-SLAM

EKF-SLAM

Don’t need to understand the mathematics behind this for the exam.

This method operates by using an EKF to estimate both the robot pose and the locations of the features in the environment. The state space can be:

$x_t = (\underbrace{x, y, \theta}_{\text{Robot pose}}, \underbrace{m_{1,x}, m_{1, y}}_{\text{Landmark 1}}, \cdots, \underbrace{m_{n,x}, m_{n, y}}_{\text{Landmark n}} )$

With $n$ landmarks, the gaussian required is $(3+2n)$ dimensional.

As with an EKF, the operation of the algorithm follows the process of:

State prediction.
Measurement prediction.
Obtain measurement.
Data association.
Update prediction.

Motion Planning

The problem of motion planning is defined as given:

Robot start pose,
Desired goal pose,
Geometric description of robot,
Geometric description of environment,

we want to find a path that:

Moves the robot from the start to goal pose,
Avoids any obstacles.

The motion planning problem is typically solved in another space, the configuration space. The configuration space is the set of all possible robot configurations. A robot configuration $q$ describes the positions of all robot points with a reference frame (the robot is not a point, so can be expressed as a vector of several positions and orientations).

Assuming a circular robot, we can obtain the C-space by ‘sliding’ the robot along the edges of all obstacles and recording the points we can reach. With a polygonal robot, the same is done, but with the sliding a combination of all orientations.

We can define the free space given a configuration space as the subset of the space such that no obstacles are present. Similarly, the obstacle space is the complement of free space.

Planning can then be done as if the robot was a point in free space.

As C-space is continuous, we want to discretise it to generate a ‘road map’ - a graph of free space where each vertex is a configuration in free space and each edge represents a collision-free path.

Combinatorial Methods

Visibility Graphs

Idea of this approach is to construct the road map by visibility through free space.

E.g. from each vertex of an obstacle in C-space, which other vertices of other obstacles can be seen?

Voronoi Diagram

Another approach is to construct a Voronoi diagram and use that as the map.

Can be constructed quickly, though the paths generated are often not optimal in open space.

Cell Decomposition

Involves splitting the free-space into non-overlapping cells (e.g. trapezoids).

This effectively approximates the edges of obstacles.

Sampling-Based Methods

Probabilistic Road Maps

Vertices are randomly sampled from C-space - if a point is in free space then we can add it as a vertex. Vertices close to each other are then connected, either using some local planning algorithm within a given radius, or a kNN algorithm.

The advantages of this approach are that it is not necessary to construct C-space and remains efficient with higher dimension C-spaces, however it can fail to compute paths through narrow passages and is not optimal.

Random Trees

Algorithm operates by aggressively probing and exploring C-space by incrementally expanding from some initial configuration. The explored territory is represented by a tree rooted at $q_0$.

In some cases, this can be done bi-directionally, growing two trees with one starting at the initial pose and one starting from the goal pose.

Searching

Searching is often done using A*, though of course any standard search algorithm can be used.

Importantly however, these assume the robot will be able to execute the planned path perfectly, which in practice is often not the case.

Sensors

Robots need sensor to measure parameters of the robot itself, and its environment.

The Sensing Process

The sensing process consists of five main steps:

Transducer converts a physical quality into an electrical signal.
Signal Conditioning - Want our signals to be linear, and may need to be amplified.
Computer Interface.
Data Reduction and Analysis - Data may be compressed if required and fused to generate a model of the world.
Perception - Models are analysed to infer about the state of the robot surroundings.

Characterising Sensors

Sensors are noisy and often return an incomplete picture. They may also vary in performance and accuracy given the environment.

Some characteristics to consider are:

Response time,
Cost,
Error rate,
Robustness,
Power requirements,
Sensitivity,
Linearity,
Range,
Accuracy,
Resolution.

Resolution

Resolution is the minimum range between any two values which can be detected. For digital sensors, this is usually the resolution of the ADC.

Bandwidth or Frequency

Used to measure the speed with which a sensor can report readings - measured in Hertz.

Mobile robots are often limited in maximum speed by the bandwidth of their obstacle detection sensors.

Sensitivity

Measure of how much the target signal changes the output signal - ratio of output change to input change.

Cross-sensitivity is a measure of sensitivities orthogonal to the target parameter (e.g. using a compass to measure in an iron building).

Accuracy and Error

Defined as the degree of conformity between the sensor measurements and the true value.

$\text{Accuracy} = 1 - \frac{\text{Measured Value} - \text{True Value}}{\text{True Value}}$

The difference between the measured value and true value is known as the error.

There are two types of error:

Systematic - Predictable and caused by factors which can be modelled (e.g. calibration etc.).
Random - Not predictable but can be treated in a probabilistic way.

Precision

Related to the reproducibility of the results:

$\text{Precision} = \frac{\text{Range}}{\sigma}$

where $\sigma$ is the standard deviation of the random error.

Sensor Classification

Sensors may be classified based on particular characteristics:

What they measure.
Principle behind the sensor.
Technology employed.
Energy used.
Spatial relationship between sensor and object.

One example is Proprioceptive (Internal) vs Exteroceptive (External):

Proprioceptive (Internal)
- Measure values internal to the robot.
- Required for kinematic modelling and control.
Exteroceptive (External)
- Information from the robot’s environment.
- E.g. distances to objects, intensity of ambient light etc.

Another is Passive vs Active:

Passive
- Sensor just listens.
- Energy comes form the environment.
Active
- Emit energy and measure response.
- Often better performance but may have some influence on the environment.

Common Sensors

Inclinometers
- Used to measure how much a robot is tilted.
- Two possible implementations.
- Mercury Switch
  - Small drop of mercury between two contacts.
  - Depending on orientation depends whether circuit is closed or not.
  - Output is binary.
- Electrolyte Sensor
  - Electrodes immersed in conductive fluid.
  - Resistance is a function of the rotation of the sensor.
  - Analogue, but more expensive than the mercury sensor.
Accelerometers
- Measure linear acceleration.
- Usually cheap.
Gyroscopes
- Measure rotational acceleration.
Wheel Encoders
- Measure rotational motion and position.
- Use a light emitter and detector together with a wheel to produce a square wave.
- Counter detects rising and falling edges to determine motion.
- Quadrature encoders use pairs of slits with slightly different phase, which can be used to improve resolution and allow measuring direction of motion.
Touch Sensors
- Fall into two main categories:
- Contact Sensors
  - Push switches.
  - Lever switches.
- Tactile Sensors
  - Use piezo-electric material.
  - Often arranged in 2D arrays to measure force patterns.
Proximity Sensors
- Passive Infrared Sensors
  - Very simple and cheap method of detecting motion of object which emits IR.
  - Only provide binary detection.
  - Very noisy signals.
- Hall Effect Sensors
  - Measure change in magnetic fields.
  - Simple and cheap.
  - Can be affected by magnetic fields other than those targeted.
Light and Temperature Sensors
- Operate using Ohm’s law - are variable resistors depending on some property.
Range Sensors
- Detect the distance to an object.
- Three main approaches…
  - Time of Flight Ranging
    - Use speed equation.
    - Send out pulse of sound or light and time how long it takes to come back.
  - Phase Shift Measurement
    - Reference beam travels distance $L$ to detector from beam splitter.
    - Reflected beam travels $D’ = L + 2D$, where $D$ is distance from beam splitter to target.
    - If $D = 0$, then both beams would arrive simultaneously.
    - If $D$ increases, there will be a phase difference between the beam emitted and directly measured, and the reflected beam. In that case:
    - $D’ = L + \frac{\theta}{2 \pi}\lambda \implies D = \frac{\lambda}{4\pi}\theta$.
    - Measurements can be ambiguous - unique solution can only be obtained if $2D < \lambda$.
  - Triangulation
    - Measure angle between emitted and reflected beam to determine distance to object.
- Common implementations of these are…
  - Ultrasonic Sensors
    - Operate using time of flight.
    - Only effective at short range.
    - Use high frequency sound.
    - Conceptually simple.
    - Very noisy, have to consider cross-talk and specular reflections.
  - Radar
    - Uses radio waves.
    - Relatively cheap, accurate, and works over a long range.
    - Though needs proper beam forming and signal processing, and prone to interference.
  - Laser Range Finders
    - Limited range - around 50-150m.
    - Often expensive and use a lot of power.
    - Cannot be used to detect transparent objects.
  - LIDAR
    - Uses lasers and spinning mirrors to obtain a map of the environment.
    - Very expensive.

GPS

For measuring position in the world, we can use GPS. The civilian version is accurate to about 10m.

At least 4 satellites can be seen from almost anywhere, and each transmits its position and the current time. Receivers can pick up these signals and determine their position using trilateration (uses time difference to measure distance from satellites).

Derivation

Consider the case of three satellites, with distances measured to each as $r_1, r_2$ and $r_3$. We assume that the first satellite is positioned at $(0, 0)$, the second is in the same horizontal plane as the first and so has position $(d, 0)$ for some $d$, and the third is at position $(i, j)$ where the $j$ direction is orthogonal to the plane created by connecting satellites 1 and 2.

Therefore, we have:

$r_1^2 = x^2 + y^2 + z^2$ $r_2^2 = (x-d)^2 + y^2 + z^2$ $r_3^2 = (x-i)^2 + (y-j)^2 + z^2$

Now $r_1^2 - r_2^2 = x^2 - (x-d)^2$. This can be expanded to obtain $r_1^2 - r_2^2 = 2xd - d^2$, and so we are able to obtain:

$x = \frac{r_1^2 - r_2^2 + d^2}{2d}$

Now we solve for $y$ using $r_3^2 - r_1^2$:

$\begin{align} r_3^2 - r_1^2 &= (x-i)^2 + (y-j)^2 - x^2 - y^2 \\ &= x^2 -2xi + i^2 + y^2 - 2yj + j^2 - x^2 - y^2 \\ &= i^2 + j^2 - 2xi - 2yj \\ 2yj &= r_1^2 - r_3^2 + i^2 + j^2 - 2xi \\ y &= \frac{r_1^2 - r_3^2 + i^2 + j^2}{2j} - \frac{xi}{j} \end{align}$

Finally, $z$ can simply be obtained from the equation for $r_1^2$ by:

$z = \pm \sqrt{r_1^2 - x^2 - y^2}$

In practice however, we usually use 4 satellites. This is because accurate positioning relies on accurate timing, and receiver clocks are much less accurate than the transmitting clocks. Therefore, you actually need to solve for $x$, $y$, $z$, and $t$ simultaneously.

Differential GPS can be used to improve regular GPS using information from land-based transmitters with exact known locations.

Vision

Robot vision is the automatic understanding of images and video.

The sensing process consists of first transducing (using cameras/sensors to detect the information), then measurement and modelling, leading finally to perception.

Often, image data is reduced in resolution (either by deleting pixels or reducing colour data) before processing. In some cases, binary images are used.

Structure from Stereo Vision

Given two image sensors, distance $b$ apart, with a focal length $f$, and target point $(x, y, z)$:

If the coordinates of the target point are $u_l$ and $u_r$, then the disparity is $u_l - u_r$.
By similarity of triangles, $\frac{f}{z} = \frac{u_l}{x} = \frac{-u_r}{b-x}$.
Therefore $z = b\frac{f}{u_l - u_r}$.

2D Filtering

Filtering of an image using a 2D mask involves sliding the mask over all the image and computing weighted sums to replace the centre pixels’s intensity value.

Smoothing

Used for blurring and noise reduction.
Each pixel replaced by an average of the intensity over some pixel neighbourhood.
Gaussian filters are similar and use numerical patterns approximating a 2D gaussian from the centre pixel.

Line Detection

Use positive values surrounded by negative values in the direction of the lines you want to detect.
- For example, horizontal lines could be detected using $\begin{pmatrix}-1 & -1 & -1 \\ 2 & 2 & 2 \\ -1 & -1 & -1\end{pmatrix}$.

Edge Features

Edges typically occur on the boundary between two different regions in an image.
‘Line drawings’ of a scene often useful by higher-level computer vision algorithms.
The gradient at edges can also be calculated - there are two common methods for this - Prewitt and Sobel.
- The horizontal Prewitt matrix is $\begin{pmatrix}-1 & -1 & -1 \\ 0 & 0 & 0 \\ 1 & 1 & 1\end{pmatrix}$. The vertical matrix is this rotated anticlockwise 90 degrees.
- The horizontal Sobel matrix is $\begin{pmatrix}-1 & -2 & -1 \\ 0 & 0 & 0 \\ 1 & 2 & 1\end{pmatrix}$. The vertical matrix is this rotated anticlockwise 90 degrees.
- Diagonals can also be detected by rotating the matrices.

Canny Edge Detector

Involves:

Smoothing the image with Gaussian filter.
Computing the gradient magnitude of the image.
Apply non-maximal suppression to reduce spurious responses.
- Check if a pixel is a local maximum along the gradient direction, and select the single maximum across the width of an edge.
- This may result in thin edges being broken.
Track edges by hysteresis.
- If the gradient is greater than some high threshold, keep the edge.
- If between some low threshold and the high threshold, these are weak responses and may be kept.
- Responses below the low threshold are treated as noise and suppressed.

Thick edges can be avoided by considering the second derivatives at points - edges indicated by zero crossings. This is usually noisier, but produces more localised results.

The second order derivative can be approximated using a discrete Laplacian matrix - e.g. $\begin{pmatrix}-1 & -1 & -1 \\ -1 & 8 & -1 \\ -1 & -1 & -1\end{pmatrix}$.

Interest Point Detection

For the ideal feature detector, we desire:

Accuracy
- Detect all (or most) true interest points.
- No false interest points.
Precision
- Well localised.
Repeatability across different images.
Robustness to noise.
Efficiency in detection.

Corner Detection

There are two main approaches for this:

Based on Brightness
- Usually the derivatives of the image.
- This is the type covered in the module.
Based on Boundary Extraction
- First detect edges, then analyse curvature etc.

With corners, considering a small ‘window’, shifting in any direction should result in a large change in appearance in the window. We could define this ‘Cornerness’ as:

$E(u, v) = \sum_{x,y} \underbrace{w(x, y)}_{\text{Window function}} [\underbrace{I(x+u, y+v)}_{\text{Shifted intensity}} - \underbrace{I(x,y)}_{\text{Intensity}}]^2$

The Harris Corner Detector works by looking at the eigenvalues of a matrix of image gradients. If both eigenvectors are large, the region is classified as a corner. Non-maximal suppression can then be applied to these regions to identify single point corners.

The Harris Corner Detector is rotation invariant, partially intensity invariant, but not scale invariant.

SIFT

Scale Invariant Feature Transform. Operates by transforming data into scale-invariant coordinates.

SIFT has a number of advantages:

Locality - Features are local, so robust to occlusion and clutter.
Distinctiveness - Individual features can be matched to a large database of objects.
Quantity - Many features can be generated for even small objects.
Efficiency - Close to real-time performance.
Extensibility - Can easily be extended to a wide range of differing feature types, each adding robustness.

Briefly, it operates by:

Generate ‘scale space’.
Find extrema in the scale space - these are potential locations for finding features.
Find the location of key points - accurately locating the feature ‘key’.
Assign an orientation to each key point.
Describe the key point using the SIFT descriptor.

The scale space is obtained by convolving the image with a variable scale gaussian. The Laplacian for all the gaussian blurred images is then computed.

Each pixel is compared with 26 pixels in the current and adjacent scales of the image, and is selected if it is larger or smaller than all of the other 26 pixels. The set of points is then refined by removing points with low contrast, and removing points on edges (using a similar approach to the Harris corner detector).

Orientation is assigned using a histogram of gradient directions in a neighbourhood around the key point.

The SIFT Descriptor is then obtained by extracting the 16x16 pixel region around the key point, breaking it into 16 smaller 4x4 windows, and then within each of these computing the gradient magnitudes and orientations, placing these into an 8-bin histogram. Finally, rotate the orientation to respect that of the key point, normalise to unit length, and threshold the feature vector.

Summary

Construct Scale Space
Take Difference of Gaussians
Locate DoG Extrema
Sub Pixel Locate Potential Feature Points
Filter Edge and Low Contrast Responses
Assign Key Points Orientations
Build Key Point Descriptors
Use the features elsewhere…