# Bag of Tricks for Efficient Text Classification

- WORK IN PROGRESS*

## Contents

## Introduction and Motivation

Neural networks have been utilized more recently for Text-Classifications and demonstrated very good performances. However, it is slow at both training and testing time, therefore limiting their usage for very large datasets. The motivation for this paper is to determine whether a simpler text classifier which is inexpensive in terms of training and test time can approximate the performance of these more complex neural networks.

The authors suggest that linear classifiers are very effective if the right features are used. The simplicity of linear classifiers allows a model to be scaled to very large data set while maintaining its good performance. The basis of the analysis for this paper was applying the classifier fastText to the two tasks: tag predictions and sentiment analysis, and comparing its performance and efficiency with other text classifiers. The paper claims that this method “can train on billion word within ten minutes, while achieving performance on par with the state of the art.”

## Background

- PLACEHOLDER: we should look at when this

### Natural-Language Processing

- Briefly describe the difference between NLP and text-mining. Maybe comment later about whether fastText accomplishes NLP.

- might want to briefly mention types of models that are used in the experiment for comparison

## Model

### Model Architecture of fastText

- PLACE HOLDER FOR IMAGE FROM ARTICLE*

Linear classifier is limited by its inability to share parameters among features and classes. As a result, classes with very few examples (low frequency) will often get classified in a large output field. The model in the paper is built on top of a linear model with a rank constraint and a fast loss approximation.

Each [math] N [/math] represents a seperate [math] N [/math]-th gram features in the sentence. This feature will be explained in a coming section.

### Softmax and Hierarchy Softmax

Softmax function *f* is used to compute the probability density over the predefined classes. The softmax output layer with log-likelihood is given in the article as:

- [math] - \frac{1}{N} \sum_{n=1}^N y_n \log ( f(BAx_n))[/math] (is this the right one?)

In this formula. A and B are weight matrix which will be calculated in the training set. [math] X_n [/math] is the normalizefeature of the [math] n-th [/math] documentation. [math] Y_n [/math] is the label.

Remark: Negatively log-likelihood is a multiclass cross-entropy. What this means is that for a binary problem (dog or not dog), it will output two values between [0,1] where the sum of the two values equates to 1. (Dog = 0.6, Cat = 0.4). This can further be expanded into larger dimensions. In contrast, sigmoid outputs one value and in the binary case, the other value can be derived via 1 - p.

Softmax will have a complexity of O(kh) where k is the number of classes and h is the number of dimensions of text representation. The function that the authors used for their model was a variation of the softmax function, known as **Hiearchy Softmax**. The hiearchy softmax is based on the Huffman Coding Tree and will reduce complexity to O(H*log2(k)).

### N-gram and Bag of Words

**Bag of Word Model** is a model for simplifying text representation by storing a set of words and their frequency count in a document. **Bag of word is invariant to word order (single word and dictionary based)**. An example of the model can be found in this wikipedia page [ https://en.wikipedia.org/wiki/Bag-of-words_model].

(1) John likes to watch movies. => BoW1 = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1} (2) Mary likes movies too. => BoW2 = {"John":1,"also":1,"likes":1,"to":1,"watch":1,"football":1,"games":1};

If a document contains a union of the (1), (2), and (3) then (3) John likes to watch football game {"John":2,"likes":3,"to":2,"watch":2,"movies":2,"Mary":1,"too":1,"also":1,"football":1,"games":1};

The problem with the bag of word model is that it requires an extensive dictionary of words on file and would take a long time to search through. Additionally, **bag of word losses information due to it being single word and invariant to order.** Lastly, it will fail if the training set does not include the entire dictionary of the testing set.

**N Gram Model (Word Based)** is a model for simplifying text representation by storing n local words adjacent to the initial word. Compared to bag of words, any N over 1 (noted as Unigram) will contain more information than bag of words. An example of how N gram stores and identifies order will be demonstrated in a later section. An example of a document being stores in both Bag of Words, Unigram (N = 1) and Bigram (N = 2) can be found below:

- PLACE HOLDER FOR PICTURE*

Source: https://qph.fs.quoracdn.net/main-qimg-c47060d2f02439a44795e2fbcf2ca347-c

In the article, N gram model was used instead of Bag of Words because the authors wanted to capture the information about the local order.

### Feature Hashing

The authors utilized a feature hashing to map N-gram more efficiently. (SOURCE https://en.wikipedia.org/wiki/Feature_hashing) Feature hashing is a way to vectorize n-gram of a document. It is an effective tool when dealing with n-gram of higher dimension spaces. The algorithm creates a **hash table**, which is a special data structure that contains a hash function and a matrix. The hash function will map the appropriate n-gram into the matrix. An example of this is found in the below example

A = “I love apple”

B = “apple love I”

C = “I love sentence”

A | B | C | |

I | 1 | 1 | 1 |

love | 1 | 1 | 0 |

apple | 1 | 1 | 0 |

sentence | 0 | 0 | 1 |

Notice how A and B are the same vector. This is just like bag of word and the aforementioned problem of **order does not matter!**

A | B | C | |

I love | 1 | 0 | 1 |

love apple | 1 | 0 | 0 |

apple love | 0 | 1 | 0 |

love i | 0 | 1 | 0 |

love sentence | 0 | 0 | 1 |

Notice now, A and B are unique because bigram takes into consideration one space of local words. However, A and C also have similar elements, being I love. IF we were to further increase N in N-gram we will have an easier time in classifying the distinction between the two. Higher, the consequences of operating in higher dimension of N gram is that the run time will increase.

The author utilized this hashing trick from Mikolov et al. (2011) and 10M bins of only used bigrams, 100M otherwise. This suggests that there are a total combination of sqrt(10 mil) different words in the data base from the training set.

### Feature Hashing

Feature hashing, aka hash trick, can be used in sentence classification which maps words to indices of an array or a matrix of fixed size by applying a hash function to features. The general idea is to map sentences from high dimensional features to lower dimensional to reduce the dimension of the input vector, and therefore, reduce the cost of storing large vectors.

A hash function is any function that maps an arbitrary array of keys to a list of fixed size. For example, consider a hash function [math] h [/math] that maps features to the value of corresponding dictionary key.

Key | Index |
---|---|

I | 0 |

love | 1 |

hate | 2 |

cats | 3 |

dogs | 4 |

but | 5 |

Mary | 6 |

In this case, [math] h(\text{"cats"}) = 3 [/math]. Considering the sentence [math] \text{"I love cats, but Mary hate cats"} [/math] and we will try to map it to a hash table with length of 7. After vectorizing it, we will have a list all words in that sentence [math] x = ["I", "love", "cats", "but", "Mary", "hate", "cats"] [/math]. Consider the hashed feature map [math] \phi [/math] is calculated by

[math] \phi_i^{h}(x) = \underset{j:h(x_j)=i}{\sum} 1 [/math], where [math] i [/math] is the corresponding index of the hashed feature map.

By applying hash function to each word of this sentence, we will get a list of returned indexes [0, 1, 3, 5, 6, 2, 3], and the corresponding hashed feature map will be

0 | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|

1 | 1 | 1 | 2 | 0 | 1 | 1 |

There are many choices of hash functions, but the general idea is to have a good hash function that distributes keys evenly across the hash table.

Hash collision happens when two distinct keys are mapped to the same indices. For example, for above example, if both "Mary" and "I" are mapped to the same index 0. The output hash table will then become:

0 | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|

2 | 1 | 1 | 2 | 0 | 1 | 0 |

In order to get an unbiased estimate, the paper uses a signed hash kernel https://alex.smola.org/papers/2009/Weinbergeretal09.pdf, which introduces another hash function [math] \xi [/math] to determine the sign of the return index. The hashed feature map [math] \phi [/math] now becomes

[math] \phi_i^{h, \xi}(x) = \underset{j:h(x_j)=i}{\sum} \xi(x_j) \cdot 1 [/math]

Consider if [math] \xi("I") = 1 \text{ and } \xi("Mary") = -1 [/math], then our signed hash map now becomes:

0 | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|

0 | 1 | 1 | 2 | 0 | 1 | 0 |

Ideally, collisions will "cancel out", and therefore, achieve an unbiased estimate.

### TF-IDF

For normal N-grams, word counts are used as features. However, another way that can be used to represent the features is called IFIDF, which is the short cut for **term frequency–inverse document frequency**. It represent the importance of a word to the document.

**Term Frequency(TF)** generally measures the time that a word occurs in a document. An **Inverse Document Frequency(IDF)** can be considered as an adjustment to the term frequency such that a word won't be deemed as important if that word is a generally common word, for example, "the".

TFIDF is calculated as the product of term frequency and inverse document frequency, generally expressed as [math]\mathrm{tfidf}(t,d,D) = \mathrm{tf}(t,d) \cdot \mathrm{idf}(t, D)[/math]

In this paper, TFIDF is calculated in the same way as Zhang et al., 2015, with

- [math] \mathrm{tf}(t,d) = f_{t,d} [/math], where [math] f_{t,d} [/math] is the raw count of [math] t [/math] for document [math] d [/math].
- [math] \mathrm{idf}(t, D) = log(\frac{N}{| \{d\in D:t\in d \} |}) [/math], where [math] N [/math] is the total number of documents and [math] | \{d\in D:t\in d \} | [/math] is the total number of documents that contains word [math] t [/math].

## Experiment

fastText was compared with various other text classifiers in two classification problems:

- Sentiment Analysis
- Tag prediction

## Conclusion

## Commentary and Criticism

## Further Reading

- List of previous paper presentations in chronological order relating to text classification/fastText