Digital Forensics

October 10, 2020

Contents:

Image Acquisition
Image Enhancement
Digital Watermarking
JPEG Compression
Compression-Based Forensics
Copy-Move Forgery
Sensor Forensics
Video Forensics

Image Acquisition

Gamma Correction

Cameras follow a linear relationship between actual and reported brightness, however our eyes follow a nonlinear relationship. Gamma correction is used to account for this - translates actual luminance to perceived luminance.

$\underbrace{\nu_{out}}_{\text{Gamma corrected output}} = \underbrace{\nu^\gamma}_{\text{Actual luminance raised to gamma}}$

Imaging Pipeline

World
Lens
CFA
Imaging Sensor
Post Processing
Digital Image
Compression

Colour Models

RGB

Essentially three-dimensional space with axes representing R, G, and B.
Used for image acquisition and display.
Usually normalised to have values between 0 and 1.
Device-dependent - Different devices capture and reproduce given values slightly differently, therefore an RGB value does not necessarily define the same colour across devices.

Y’UV

Often used for image and video compression and storage.
Has three components or channels:
- Luma Component (Y’) - Represents the brightness information (intensity with no colour values).
- Chroma Components (U, V) - Colour information encoded into these channels.
Separate channels for colour and intensity are useful because human eyes are more sensitive to intensity changes than colour changes. Can compress colour information more aggressively without perceptible changes in image quality.
Usually use a variant called Y’CbCr for images. Transformation from RGB can be made by:
1. RGB to Y’UV:
  - $Y’ = 0.299R + 0.587G + 0.114B$
  - $U = B - Y’$
  - $V = R - Y’$
2. Scale U and V:
  - $Pb = (0.5 / (1 - 0.114)) \times (B - Y’)$
  - $Pr = (0.5 / (1 - 0.299)) \times (R - Y’)$
3. Scale Y’PbPr to Y’CbCr:
  - $Y’ = 16 + 219 \times Y’$
  - $Cb = 128 + 224 \times Pb$
  - $Cr = 128 + 224 \times Pr$

Chroma Subsampling

Human eye is less sensitive to colour than luminance or intensity.
At normal viewing distance, there is no perceptible loss incurred by reducing the colour detail.
General concept is subsample by a factor of $n$ - keep every $n$th element starting from the first.
Subsampling scheme is usually represented as a three part ratio, A:b:c.
- A - Width of the region in which subsampling is performed (usually 4).
- b - Number of Cb/Cr samples in each row of A pixels.
- c - Number of changes in Cb/Cr samples between the first and second row.
Different techniques can be used, the most common being:
- Average - Subsampled chroma component is the average of the 2x2 original chroma block.
- Left - Subsampled chroma component is the average of the two leftmost chroma pixels of the block.
- Right - Subsampled chroma component is the average of the two rightmost chroma pixels of the block.
- Direct - Subsampled chroma component is the top left chroma pixel.

Comparing Images

Often essential to compare two images to know how similar or dissimilar they are.
Visual comparisons are not enough for forensic applications.

Mean Squared Error (MSE)

For two images $X$ and $Y$, MSE is defined as follows:

$MSE(X, Y) = \frac{1}{N} \sum_{i=1}^N (y_i - x_i)^2$

Where $x_i$ and $y_i$ are the pixels of $X$ and $Y$, and $N$ indicates total number of pixels.

Both images must be of the same size to be able to compute the MSE.

Correlation Coefficient

Most popular method is Pearson’s r - linear correlation between two variables. It can be computed as:

$r(X, Y) = \frac{S_{X,Y}}{\sqrt{S_{XX}S_{YY}}}$

Often $r^2$ is used instead of $r$.

Structural Similarity (SSIM)

Widely used in imaging industry for benchmarking device performances.
Suitable for images of the same dimension, when one image is a distorted version of the other.
Separates structural and non-structural information in images, measures both types of differences between two images or image regions.
Compares two images at pixel level.
- Luminance Comparison - Local luminance is modelled by mean intensity of the local region.
  - $l(\mathbf{x}, \mathbf{y}) = \frac{2\mu_x \mu_y + C_1}{\mu_x^2 + \mu_y^2 + C_1}$.
- Contrast Comparison - Local contrast is modelled by standard deviation of intensity of the local region.
  - $c(\mathbf{x}, \mathbf{y}) = \frac{2\sigma_x \sigma_y + C_2}{\sigma_x^2 + \sigma_y^2 + C_2}$.
- Structural Comparison - Remove non-structural distortion by normalisation.
  - $\sigma_{xy} = \frac{1}{N-1} \sum_{i=1}^N (x_i - \mu_x)(y_i - \mu_y)$.
  - $s(\mathbf{x}, \mathbf{y}) = \frac{\sigma_{xy} + C_3}{\sigma_x \sigma_y + C_3}$.
- These are then combined for $SSIM(\mathbf{x}, \mathbf{y}) = l(\mathbf{x}, \mathbf{y})^\alpha \times c(\mathbf{x}, \mathbf{y})^\beta \times s(\mathbf{x}, \mathbf{y})^\gamma$.
- Values can be pooled either by using the mean, or weighting according to distortion.

Image Enhancement

We often need to enhance images for forensic applications. Enhancements include improving illumination, contrast, removing unwanted noise, and sharpening.

Enhancements in the pixel (spatial) domain involves working directly on pixel values.

Spatial Domain

Histograms

Given an image with $L$ grey levels, an image histogram $h(r_k)$ is a discrete function defined as:

$h(r_k) = n_k$

Where $r_k$ denotes the $k$th grey level and $n_k$ notes the number of pixels with grey level $r_k$, and $k = 0, 1, \cdots, L-1$. A normalised histogram is a discrete function which defines a probability distribution:

$p(r_k) = \frac{n_k}{n}$ .

Contrast of an image can be enhanced using Histogram Equalisation. A high contrast image ideally should have a flat histogram spanning the entire range of intensity.

To perform this on a colour image, can equalise on each RGB channel separately.

This idea can be generalised - it is possible to transform a histogram to any other histogram using cumulative distributions. This is called histogram matching. Outline for implementation is:

Given two images, compute normalised histograms for both.
Compute CDF for both.
Replace each intensity level $x_i$ in input image to $x_j$:
1. Find cumulative value for that pixel intensity in input CDF.
2. Find corresponding pixel intensity for the same cumulative value in the reference CDF.
3. Replace pixel intensity with this corresponding intensity.

Noise Removal

Noise is a random quantity, often of unknown type.
For practical purposes, we make some assumptions about noise:
- It is additive.
- Can be modelled using a known probability distribution function (e.g. gaussian).
- Noise is independent of pixel location or value.
One way to reduce noise is through averaging.
- Local Averaging - Replace every pixel value by the mean of its neighbouring pixel values.
- Median Filter - Replace each pixel by the median of all its neighbouring pixel values.

Frequency Domain

Enhancement can also be done in the Frequency Domain.

Fourier transform - Any function can be expressed in terms of sinusoids of varying frequency. For images, we can compute a 2D discrete Fourier transform.

Noise Removal in Frequency Domain

Most noise appears in the high frequency components.
We can remove or reduce high frequency components, which helps remove noise.
Will also remove some information, however.
Some common masks are:
- Ideal Low Pass - Sharp cut off, rectangular shaped.
- Gaussian Low Pass - Smooth cutoff using Gaussian.
- Butterworth Low Pass - Mix between gaussian and rectangular (can control shape).
Notch Filters can be used to remove repetitive spectral noise.

Discrete Cosine Transform (DCT)

DCT is similar to DFT, but DCT uses weighted sums of cosines - and therefore is a real rather than complex transformation.

The (0, 0) component of a DCT is called the DC component, and controls the overall brightness of the image.

DCT is popular for compression, as high frequency components can be removed without visually altering the appearance of the image too much.

Digital Watermarking

Digital watermarking is a technique which can be used to help determine if an image is authentic. It is an active approach (as it involves inserting a signature pattern into data before it is distributed).

There are several possible application areas:

Ownership/Copyright Identification
- Can add invisible, robust watermark to an image before making it public.
Ownership Claim
- Can add a visible watermark to make an ownership claim.
Unauthorised Copy Detection
- Embed unique watermarks for each person source is distributed to, to identify who leaked it.
Tampering Detection
- Add a fragile watermark so it is destroyed if tampering attempts are made.

Types of Watermarks

Blind vs Not Blind
- Blind does not require access to original un-watermarked data to recover the watermark.
Perceptible vs Imperceptible
Private vs Public
- Only authorised users can detect private watermark.
Robust vs Fragile
- Robust - Designed to survive intentional and unintentional modifications.
- Semi-Fragile - Designed for detecting any unauthorised modification, at the same time allowing some basic modifications such as rotation, scaling, cropping.
- Fragile - Used to detect any unauthorised modification.

Bitplane Substitution

Simple watermarking technique involving changing pixel values in a a given image by changing the least important bit planes.
The watermark can be smaller or equal to the size of the image. Multiple watermarks can be inserted if needed.
Content Independent
- Different image, such as logo can be used.
- Or sequence of small random numbers.
Content Dependent
- Select pixel locations randomly.
- Watermark formed using the 7 most significant bits of each pixel.
- Different segments concatenated to form a watermark.

Selecting Embedding Locations

Pixel locations are often chosen randomly.
- To decode a watermark like this a ‘key’ is required.
Can be embedded in specific bit planes over the entire image (at all pixel locations).
Can be embedded in specific bit planes of selected image regions.
Select pixels which are more tolerant to visual changes.
- Based on local properties of image and how human visual system works.
- HVS is less sensitive to changes in the blue channel.
- HVS is less sensitive to the distortions in the edges of an image.

To create a visible watermark, we can substitute the more significant bit planes of the watermark to the LSB plane of the cover.

Watermarking by biplane substitution is:

Extremely simple and fast.
Can create visible or invisible watermarks.
Does not necessarily require the original image to recover the watermark.
Watermarks are fragile/semi-fragile.
- Simple attacks can destroy the watermark.
- Pixel locations spread over the entire image is more robust to modifications like cropping.
- The entire watermark can be removed by removing the LSB plane.
- Survives compression to some extent.

Additive Watermark

Simply add values to pixels of the image.
Decoding requires the original image (to subtract pixel values).

Watermarking in Frequency Domain

A more sophisticated approach involves changing frequency components of an image obtained using DFT or DCT.
Embed the watermark into perceptually important regions of the image so that the watermark is hard to remove without degrading the visual quality of the image.
Known to be more robust to common watermarking attacks.
Spread Spectrum (SS) watermarking involves spreading the watermark over many frequency bins so that the energy in any one bin is very small and undetectable.

Spread Spectrum Encoding

Compute the 2D DCT of image.
Identify $n$ largest coefficients, except the DC component to construct the vector $h$.
Create the watermark as a vector $w$ whose components are sampled from a Gaussian distribution.
Change the $n$ largest coefficients identified in Step 2 by $h_i^* = h_i (1 + \alpha w_i)$.
Compute inverse 2D DCT using the modified spectrogram to get the watermarked image.

Spread Spectrum Decoding

Requires access to the original un-watermarked image. Compute the watermark as follows:

$\hat{w_i} = \frac{(h_i^* - h_i)}{\alpha_i h_i}$

Robustness of Frequency Domain Watermarks

Watermarks in the frequency domain are known to be more robust to common watermarking attacks:

Cropping - Watermark is embedded in the entire spatial extent of an image.
Compression - Watermark is inserted in the components which survive compression.
Removal - To destroy the watermark, high values need to be added to all frequencies, which affects the visual quality and can be easily detected.
Filtering - Survives low-pass filtering.

Hybrid Watermarking

Block-based approaches embed watermarks in both spatial and frequency domains.
Spread spectrum watermarking can be performed in blocks.

One simple technique involves replacing the DCT components at (2, 3) and (4, 1) on a block-by-block basis, as these components have been shown to have roughly equally visual importance.

Divide image into blocks.
For each block take 2D DCT.
Choose two locations of equal perceptual importance (for example (2, 3) and (4, 1)), then take (2, 3) > (4, 1) to imply 0, otherwise 1 etc.
1. Flip values if not as desired, as visual quality will be only very slightly affected by this.

Comparing Watermarks

In simple cases, comparisons using MSE, correlation, or SSIM are sufficient.
Cosine Similarity can also be used:

$\text{sim}(\hat{w}, w) = \frac{\left<\hat{w}, w \right>}{||\hat{w}||_2 ||w||_2}$

If similarity value is above some threshold, it is considered a match.
If attacker generates a random normally distributed $w$, it may be possible in very rare cases the watermarks match - longer watermark will reduce the possibility further.

Attacks on Watermarks

Compression attacks.
Filtering attacks - e.g. smoothing with low-pass filter.
Cropping, scaling, rotation.
Collusion attacks.
- Differently watermarked images from different users are averaged.
Jitter attack.
- Basic idea is to change the locations of embedded watermark so that it can not be recovered.
- Split the audio/image into number of small chunks.
- Duplicate or Delete data points at random.
- Imperceptible in images, and even classical music.

JPEG Compression

Compression is the process of reducing the amount of bits required to store/represent a given information.
Compression ratio between two representation is $c_r = n_1 / n_2$.
- Data redundancy is defined as $r_d = 1 - \frac{1}{c_r}$.
Compression algorithms attempt to remove redundancies in images.
- Spatial Redundancy - Pixel values are highly correlated with neighbouring pixels. Correlations also exist on a higher level with repeating patterns and structures.
- Psycho-visual Redundancy - Details in images that our visual system can not see.
- Coding Redundancy - Using more bits per pixel than needed.

JPEG is the most common image compression algorithm. It is a lossy algorithm - every time you compress and image using JPEG, some information loss will occur.

The steps involved in the algorithm are:

Colour space conversion
- Converted to YCbCr - we want to separate the intensity and colour components so they can be compressed differently.
- Cb and Cr components are subsampled by a factor of 2, though this is skipped when the highest quality is required.
Division into sub-images
- Image is divided into non-overlapping blocks of size 8x8 or 16x16.
- Encoding is done on each block independently (each of these being called macro-blocks or minimum encoded unit).
- Helps ensure better compression when DCT is applied without high information loss.
Discrete Cosine Transform
- Compute DCT for each block.
Quantiser
- Main lossy step in JPEG. Each coefficient is quantised by a predefined factor.
- Many-to-one mapping, i.e. multiple input coefficients get mapped to a single output value.
- $F_Q(u, v) = \text{round}\left( \frac{F(u, v)}{Q(u, v)} \right)$
- $Q(u, v)$ is the quantisation matrix. Quantisation factors are typically larger in the high frequencies.
- Each channel will be quantised by a predefined quantisation matrix. Values are typically larger for the chroma channels.
- Designed to roughly model the sensitivity of the human visual system.
Entropy coding
- Huffman coding is used to reduce amount of bits required to represent each block.
- Order of blocks are stored starting with the low frequency components, and ending with the high frequency components.
  - This is so 0 values which will often occur in the high frequency components will be bunched together.

Huffman Coding

Should know how to do this, notes will be made later (from JPEG compression lecture).

JPEG Header

A JPEG header contains the following information:

Image dimensions.
Quantisation values.
Huffman codes.
A thumbnail image (cropped and filtered version of the original).
- Thumbnail dimension.
- Quantisation values.
- Huffman codes.
Metadata.

The header can be used as a forensic tool - up to 576 values can be extracted to form a signature of the image. Any manipulation will alter this signature, and so can be detected. Signatures can be compared to those stored for known authentic cameras.

Compression-Based Forensics

Double Compression

In JPEG, double compression means double quantisation has taken place.
Quantisation changes DCT coefficients. Double quantisation artefacts will be visible in the distribution of DCT coefficients.
Quantisation uses the floor or rounding functions - these are not invertible.
Compression by two different factors can lead to empty bins, or periodic peaks in bins, as can be seen on a histogram.

The number of original histogram bins $n(v)$ contributing to bin $v$ in the double quantised histogram is given by:

$n(v) = b \left( \lceil \frac{a}{b}(v+1) \rceil - \lceil \frac{a}{b} v \rceil \right)$

Where $b$ is the first quantisation factor, and $a$ is the second quantisation factor. This is a periodic function, with period $b$.

Limitations

In some cases the histogram of a double-quantised signal does not contain periodic artefacts. This happens if $a/b$ is an integer or if $a=b$.
The histogram may contain naturally occurring artefacts, which could hide those introduced by double compression.
If image is not stored in a lossless format, double compression is almost impossible to detect.

JPEG Ghosts

Compression-based forensic technique similar to the double compression - relies on double quantisation.
Particularly useful for detecting splicing forgery.
Can even localise regions that have possibly come from a different image.

Quantisation Error

Consider a set of coefficients $c_1$ quantised by $q_1$ and then de-quantised to produce $c_1^*.$
Suppose $c_1^*$ was then quantised a second time by $q_2$ to yield coefficients $c_2$. De-quantising $c_2$ produces $c_2^*$.
The difference between $c_1^*$ and $c_2^*$ increases as $q_2$ increases - quantised coefficients become increasingly more sparse, SSD will increase.
If an image is recompressed using the same quality as it was originally compressed, the SSD between the original and compressed versions will be 0.

Therefore, periodically recompressing an image and looking for drops in the SSD between original and compressed versions will identify the quality levels the image (or part of the image) has been previously compressed at. By examining the SSDs across the image, it is possible to identify the region which may have been compressed multiple times.

Note that the physical properties of regions in an image could affect the forensic analysis with this technique. Differences can be spatially averaged and normalised to help identify the specific regions affected.

Copy-Move Forgery

Copy-move forgery is where a region of an image may have been copied to another part of the image in order to alter its appearance.
Can be detected by matching image blocks to spot those which are identical (or nearly identical if lossy compression has been used).
Naively, this is very computationally expensive.

Circular Shift & Match

Circular shift and match is an algorithm to more efficiently detect cloned regions in an image.

Initialise a binary image A with all 0s.
Initialise $k, l$ to 0.
Until $k_{max}, l_{max}$…
1. Compute S from input image I by using a circular shift of $(k, l)$.
2. Compute binary image $D = |I-S| < t$ for some threshold $t$.
3. Erode and dilate $D$ using a structure of size $b\times b$ for some $b$ to create $D_{ed}$.
4. Update $A$ as $A = A + D_{ed}$.
5. Increment $k, l$.

Feature Matching

So far, we have been comparing images based on pixel values to find similar regions.
More useful to find similar segments by performing feature matching to detect similar cloned regions.
A feature vector is a signature of an image region that captures its important characteristics.
Basic idea is that instead of comparing blocks by MSE or SSIM, we can compare their features.
Dense Features operate on all regions in an image, no point selection method is needed.
Key-point Based Features identify regions in an image that are distinct and extract features only from those regions.

Features

Features can be global or local, depending on whether they capture properties of the entire image or a smaller part of it.
Can capture one or more of the properties of an image region.
- Shape,
- Colour,
- Texture/Patterns,
- Motion,
- Deformation of an object of interest.

Properties of a good feature are:

Compact representation.
Robust against noise.
Robust against geometric distortion.
Robust against photometric distortions.

Local Binary Pattern

Popular feature which can encode texture information of an image region.
To compute it:
- Consider a neighbourhood of a pixel (usually 3x3).
- Threshold the neighbourhood pixels using the centre pixel value.
- Create LBP code from these thresholded values.
This can withstand some noise, but is not rotation invariant.
…but we can make a rotation invariant LBP by simply choosing the minimum binary value of all the possible rotations for the feature vector.

We can then compute a LBP histogram. This can be global or for smaller regions.

One approach is for each 32x32 image region…
Divide into 4 16x16 image regions.
Create 256-d histogram for each sub-block.
Concatenate the 4 histograms together.

Codes can be further ‘compressed’ by considering Uniform LBPs, which are those which have at most 2 transitions. Most image pixels get encoded into a uniform LBP.

We can use 58 histogram bins for each uniform LBP codes and a single bin for all non-uniform codes.
This reduces the dimensionality of the histogram to 59.

Histogram of Oriented Gradients

One of the most popular features in image analysis and computer vision.
Can approximate gradients at pixels using kernel convolution.
With gradients in $x$ and $y$ directions computed, can also compute their angles.
Can then construct a histogram with bins as angles, and weighted by pixel gradient magnitudes.
As before, can consider a 32x32 block split into 16x16 regions.
- Could use a 9-dimensional histogram for each block, then concatenate blocks.

Harris Corner Detector

This is a popular key-point based feature detector.

The basic idea is we are looking for regions which change in both the horizontal and vertical directions - these are likely to be corners. Look at error between two patches shifted by $(u,v)$:

$E(u, v) = \sum_{x,y}\left[ I(x+u, y+v) - I(x, y) \right]^2$

Edges and corners will yield a large $E(u,v)$. Using first-order Taylor series expansion, we can represent this as a matrix equation (same matrix as in Robotics), and the corner response $R$ is defined as $R = \text{det}(M) - k(\text{trace}(M))^2$, where $k$ is some constant.

For each pixel, the image gradient can be approximated using kernel convolution.

We can then threshold by some $t$ to identify features only with large enough $R$ values.

Feature Matching

Given $N$ features extracted from an image, we wish to find the most similar patches.
Can be done by brute-force, computing distances between some $f$ and the remaining $N-1$ features.
The ‘distance’ between features can be calculated in several different ways:
- L2 Distance: $||f_1 - f_2||_2 = \sqrt{\sum_i (f_{1_i} - f_{2_i})^2}$.
- L1 Distance: $||f_1 - f_2||_1 = \sum_i |f_{1_i} - f_{2_i}|$.
- Sum of Squared Distances (SSD): $SSD(f_1, f_2) = \sum_i (f_{1_i} - f_{2_i})^2$.
- Mean Squared Error: $MSE(f_1, f_2) = \frac{1}{N} \sum_{i=1}^N (f_{1_i} - f_{2_i})^2$.
- Inner Product: $\left< f_1, f_2 \right> = \sum_i f_{1_i} f_{2_i}$.
Matches which are weak are discarded, as are matches which fail the ratio test (compute ratio between SSDs of features, and bad matches will have ratio close to 1).

Sensor Forensics

Imperfections and noise characteristics of imaging sensors allow forensic experts to match an image to its source device.
These noise components are intrinsic to the image acquisition process and therefore cannot be avoided.

Sensor Noise can be split into two broad categories: Shot Noise and Sensor Pattern Noise.

Shot Noise is random.
- Roughly follows a Poisson distribution.
Sensor Pattern Noise is deterministic.
- Stays approximately the same if multiple images are taken with the same camera.
- Can be further broken up into different components:
  - Fixed Pattern Noise
    - Arises due to variation in sensitivity when sensor is exposed to light.
    - Is additive.
    - Can easily be surpassed by subtracting a dark image (taken with the same sensor) from a given image.
    - Many cameras do this by default.
  - Photoresponse Non-Uniformity
    - Dominant part of SPN.
    - Pixel Non-Uniformity
      - Primary component of PRNU.
      - Arises due to inhomogeneity of silicon wafers and imperfections during manufacturing.
      - Not affected by any external conditions.
      - PNU pattern of a source is unique, and therefore can be used for source identification.
    - Low Frequency Defects
      - May arise due to light refraction on dust particles, optical characteristics.
      - Not a characteristic of the sensor itself, so cannot be used for source identification.

We want to use PRNU for identification.

We can model the image acquisition process as:

$y_{ij} = \underbrace{f_{ij}}_{\text{PRNU}}(\underbrace{x_{ij}}_{\text{Incident light on pixel}} + \underbrace{\nu_{ij}}_{\text{Shot noise}}) + \underbrace{c_{ij}}_{\text{FPN}} + \underbrace{\epsilon_{ij}}_{\text{Random noise}}$

Ignoring shot and random noise, we can therefore write the PRNU-corrected sensor output, $\hat{x_{ij}}$, as:

$\hat{x_{ij}} = \frac{(y_{ij} - c_{ij})}{\hat{f_{ij}}}$

$\hat{f_{ij}}$ is an approximation of PRNU, and can be obtained by averaging images before processing of a uniformly lit scene taken by the target sensor.

Approximating Reference SPN

Ignore FPN.
For a series of images taken by the sensor…
Suppress scene content by demonising (possibly using averaging).
Subtract de-noised version from original image.
Average noise residuals.

Comparing Reference and Test SPN

Can use similar approach to that in watermarking.
Can compute the similarity or correlation between reference SPN and SPN approximation of given image.

Video Forensics

Videos are just sequences of images. Therefore, many of our existing techniques are also applicable to videos.
There are some forgeries which are unique to video, however.

Video Encoding

Individual frames can be encoded just like an image.
However, clearly in a lot of cases there is large redundancy in the temporal direction (between frames).
Removing these redundancies is how we efficiently encode video.

We assign some different types of frames:

Intracoded (I) Frames - Also sometimes called keyframes, these are encoded independently like a similar image, using a technique similar to JPEG.
Predicted (P) Frames and Bi-directional Predicted (B) Frames
- Use motion estimation and compensation techniques to estimate how the frames change in time.
- We predict these frames using other frames.

P-Frame Prediction

Consider an 8x8 block in the P-Frame to be encoded, at location $(i,j)$.
We look for the matching block in the previous reference frame.
- Since the changes between two frames are likely to be small, searching only a small window will likely suffice.
MSE is used to find the best matching block.
Displacement of the block is stored in a motion vector.
A predicted frame is then assembled using the motion vectors, and an error frame is computed as the difference between the target frame and the predicted frame.
- Error frame is encoded using Huffman coding, similar to JPEG.
Note the details here are light as precisely how this process works in detail isn’t a concern for us.
P-Frames are constructed by looking at previous I and P frames, and B-Frames are constructed by looking at previous I and P frames and future I and P frames.

Each colour channel is encoded independently (as with images), and each sequence is divided into Group of Pictures (GOPs). An example arrangement could be: [I B B P B B P].

Frames are typically compressed in a periodic sequence of GOP, and GOPs can be parameterised by the number of frames, $N$, and the spacing of the P-frames, $M$.

When decoding:

I-Frames are decoded as independent images.
P and B Frames are decoded depending on reference frames.
- Reference frame + Motion vectors + residual frame.

Frame Deletion Forgery

Recall each sequence is divided into Group-of-Pictures.
If an attacker deletes frames, the remaining frames must be re-ordered and re-encoded using the same GOP structure.
This can lead to frames which were previously I-Frames being re-encoded as P-Frames etc.
For all GOPs after the deleted frame, this leads to larger prediction error, as it will be compounded following the re-encoding.
The increase in prediction error is periodic, as it appears in each GOP following frame deletions.

This pattern of motion error is easy to detect as the motion errors are explicitly encoded. We can extract the residual frames from the compressed video stream and compute the mean prediction error for the entire frame. Periodic spikes in prediction errors indicate tampering.

To spot this, we can use a Fourier transform of the mean prediction error and look for peaks arising from the periodic variations.

Re-Projection Forgery

Involves detecting if a video has been recorded from a screen (e.g. counterfeit films recorded from a cinema).
If a video has been re-projected, it could be modelled as a transformation with two different camera matrices.
The important thing to look at is the skew factor (first row, second column).
If the video is a reproduction, it can be shown this skew will often be non-zero.
We can use a camera model and planar Homography to attempt to calculate the skew, with a large value indicating forgery.

There is much more detail of the maths in the slides, but this appears to be the general overview of the technique.