Alexander Logan

Foundations of Data Analytics

Contents:

  1. Data Basics

Data Basics

Data Types

Each attribute can be viewed as describing a random variable. Distributions from data are typically discrete.

Skewness

Distributions

See from slide 16 in data basics.

Q-Q Plots

Compare the quantile distribution to another.

Covariance

Calculated as:

If X and Y are independent, then covariance is 0. But, if covariance is 0, then X and Y can still be related.

PMCC

See from slide 26 in data basics.

Correlation for Ordered Data

For correlation for categoric data, see from slide 31 in data basics.

Distance Functions

For Cosine Similarity, see slide 39.

Edit Distance

Counts the number of insertions, deletions, and changes of characters to turn one string into another.

Can be implemented using dynamic programming.

Data Preprocessing

Missing Values

Approaches for handling:

Finding Outliers

For numeric data:

For categoric data:

To deal with outliers:

Random Sampling

A random sample is often enough when reducing a large dataset, but can miss out on small sub-populations.

Feature Selection

Includes principal component analysis, greedy attribute selection.

See from slide 55 in data basics.


Regression

Regression lets us predict a value for a numeric attribute.

Linear Model With Constant

For Coefficient of Determination, see slide 13 in Regression.

Dealing with Categoric Attributes

For regularisation, see slide 30.

Correlations

Two perfectly correlated attributes can’t be used in regression. This would violate the requirement of no linearly dependent columns.

Predicting Categoric Values

See slides 36 onwards. Includes odds ratios, log odds ratios, logit transformation, logistic regression.


Classification

Classification is about creating models for categoric values.

The central concept is data = model + error.

Classification Process

  1. Training - Obtain a data set with examples and a target attribute.
  2. Evaluation - Apply the model to test data with hidden class value - evaluate quality of model by comparing prediction to hidden value.
  3. Application - Apply model to new data with unknown class value, and monitor accuracy.

Evaluation Approaches

Classifier Quality

Class Imbalance

Kappa Statistic

Compares accuracy to random guessing.

For computing kappa, see slide 13.

Kappa = 1 indicates perfect classification, kappa = 0 is random guessing.

Entropy

‘Informativeness’ is measured by entropy of the distribution.

Information Gain

General principle: pick the split that does most to separate classes, that is pick split with highest information gain.

The information gain from a proposed split is determined by the weighted sum of the entropy of the partitions.

For more information see slide 24.

Decision Tree Algorithms

Different decision tree algorithms use different split rules. See slide 27 onwards.

Naive Bayes and SVM

See slide 35 onwards.


Clustering

Clustering aims to find classes without labelled examples. It is an ‘unsupervised’ learning method.

Distance Measurement

Some clustering algorithms assume a certain distance function.

We use distances that obey the metric rules:

Most distance measurements of interest are metrics.

Objective Functions

It is NP-Hard to find an optimal clustering. There are two computational approaches:

Cluster Evaluation

Hierarchical Clustering

Hierarchical Agglomerative Clustering (HAC) creates binary tree structure of clusters.

Can use single-link, complete-link, or average-link.

Algorithm is polynomial.

TODO: Detail here.

Density-Based Clustering

DBSCAN Algorithm from slide 48.


Recommendation

Recommender systems produce tailored recommendations.

Evaluation is similar to evaluating classifiers:

Neighbourhood Method

  1. Find $k$ other users $K$ who are similar to target user $u$.
  2. Combine the $k$ users’ weighted preferences.
  3. Use these to make predictions for $u$.

Latent Factor Analysis