TCCT Literature Survey

September 1, 2020 — #deeplearning #nlp #transformers

Research Paper Links

[1] https://arxiv.org/pdf/1810.04805.pdf

[2] https://arxiv.org/pdf/1906.08237v2.pdf

[3] https://arxiv.org/pdf/1910.10683.pdf

[4] https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8894084

[5] https://docplayer.net/14928867-Twitter-analytics-for-insider-trading-fraud-detection.html

[6] https://www.sciencedirect.com/science/article/pii/S0893608019302187

[7] https://arxiv.org/pdf/1802.09957.pdf

[8] https://arxiv.org/pdf/1809.07572.pdf

[9] https://arxiv.org/pdf/1907.11692.pdf

[10] https://www.aclweb.org/anthology/W18-4412.pdf

[11] https://www.aclweb.org/anthology/W17-1101.pdf

[12] https://csce.ucmss.com/cr/books/2018/LFS/CSREA2018/ICA4290.pdf

[1]

This paper focuses on all the layers of the model, pre-trained procedures, fine-tuning the model and then perform an analysis to based on parameter basis such as GLUE score,MultiNLI accuracy and F1 score . Bert architecture is built upon top of Transformer blocks. While doing the pre processing it does the input text representation by combining the respective position ,segment and token embeddings. Due to these preprocessing steps, it makes this NLP model easily available to do finetuning to do different kinds of NLP projects. It pre-trained the model based on two tasks i.e Masked Language Modeling and Next Sentence Prediction. All this process combines to a great pre-trained model for language understanding which is depicted by various metrics. We can even see various ablation experiment such as effect of pre-training tasks,effect of model size and feature based approach in order for better understanding their relative importance. However, this paper also suggests us that relying on corrupting the input with masks, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy.

[2]

The paper describes us about the XLNet model and also tells us how is it better than BERT. The fundamental principles behind this model involves generalised autoregressive pretraining for language understanding and the transformer XL. Where autoregressive modelling is used to predict the next word using the context words occuring either before or after the missing word. XLNet major advantage comes from permutation language modelling technique (This technique uses permutations to generate information from both the forward and backward directions simultaneously) during the pre training step. This papers also presents a fait contrast between the BERT and XLnet model. We can see best performance of three different variants of BERT and XLNet trained with the same data and hyperparameters. The analysis tells us that XLNet outperformed the BERT model. One of the reason for XLNet performing better is the use of Transformer XL which is an enhanced version of the transformer used is BERT due to the addition of components the segment recurrence mechanism and relative encoding scheme.

[3]

In this paper, the author has discussed a very interesting concept of combining transfer learning methods for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. The framework and model is referred as "Text-to-Text Transfer Transformer" (T5). The paper actually highlights the importance of cleaning the data, and clearly elucidates how this was done. T5 model works on training on unlabelled data and then fine-tuning this model on the labeled text. The baseline model is designed ensuring that encode and decoder are each similar in configuration of BERT(base) stack. While doing text pre-processing they use SentencePiece to encode text as WorPiece tokens. Inspired from the BERT model they randomly samples and then drops out 15% of tokens in the input sequence. There is a clear performance stats of this model based on different benchmark such as GLUE, SGLUE, EnRo, etc.

[4]

In 2019, Saad and Yang had aimed on producing a complete tweet sentiment analysis on the basis of ordinal regression with machine learning algorithms. The suggested model included pre-processing tweets as first step and with the feature extraction model, an effective feature was generated. The methods such as Support Vector Regression (SVR), Random Forest(RF),Multinomial logistic regression (SoftMax), a were employed for classifying the sentiment analysis. The Decision Trees were also used for task classification and regression. Moreover, twitter dataset was used for experimenting the suggested model. The performance of the model was measured by using the Mean Absolute mean and mean square error.The test results have shown that the suggested model has attained the best accuracy, and also DTs were performed well when compared over other methods.

[5]

Gann et al. selected 6,799 tokens out of 15000 tokens that occur 50 times or more in the overall dataset based on Twitter data, where each token is assigned a sentiment score, namely TSI(Total Sentiment Index), featuring itself as a positive token or a negative token. The SVM and Decision Tree models were trained with different training data. They used a method called the Granger Casualty and Durbin Watson Test in which the daily tweets were processed by the trained SVM and instead of using a daily count of positive and negative tweets as the metric, a Sentiment Key Performance Index (SKPI) and stock market value time series are used as an indicator of sentiment. GCA is based on the assumption that if a variable X causes Y, then changes in X will systematically occur before changes in Y and the lagged values of X will illustrate a statistically significant correlation with Y. DSI (Daily Sentiment Index)was also created to compute the daily positive and negative sentiment counts returned by the model. It behaves like a time derivative and spikes up and down during sentiment change. When combined with all these methods they showed best result predictions on Tweets.

[6]

In 2019, Park et al have developed a semi-supervised sentiment-discriminative objective for resolving the issue by documents partial sentiment data. The suggested model not only reflected the partial data, but also secured the local structures obtained from real data. The suggested model was evaluated on real time datasets. The results have shown that the suggested model was performing well. In 2019, Vashishtha and Susan have calculated the sentiment related to social media posts by a new set of fuzzy rules consisting of many datasets and lexicons. The developed model combined Word Sense Disambiguation and NLP models with a new unsupervised fuzzy rule-based model for categorizing the comments into negative, neutral, and positive sentiment class. The experiments were performed on 3 sentiment lexicons, four existing models, and nine freely available twitter datasets. The outcomes have shown that the introduced method was attaining the best results.

[7]

The CNN have been widely applied to image classification problems due to its capability to exploit the 'local stationarity' property of image data. It can be interpreted as the attribute of an image pixel to present dependency between neighboring pixels. The same goes for word embeddings, that is, a word in a sentence is dependent on its neighboring words of the same sentence. This dependency is exploited by training a CNN on the word embeddings and tuning it to perform classification tasks. The paper authored by Spiros et al. arrived at this conclusion that the convolutional network performs better that the well established traditional methods including SVM, KNN, NB and LDA.

[8]

This paper focuses primarily on the challenges faced when approaching the task of toxic comment classification. Some of the discussed challenges were: occurrence of out-of-vocabulary words and misspelled words, long-range dependencies and, multi-word toxic phrases. These challenges introduce significant difficulties when training models that aim to identify and classify toxic sentences. Further challenges include doubtful labels, toxicity without swear words, rhetorics and metaphors and, sarcasm and irony.

[9]

RoBERTa is a pre-training approach developed to overcome the shortcomings of BERT. The differences were: model was trained over more data, longer and with a bigger batch size; removed next sentence prediction objective; dynamically changing the masking pattern applied to training data. BERT was optimized with Adam using the following parameters: β1 = 0.9, β2 = 0.999, ǫ = 1e-6 and L2 weight decay of 0.01. The learning rate was warmed up over the first 10,000 steps to a peak value of 1e-4, and then linearly decayed. BERT was trained with a dropout of 0.1 on all layers and attention weights, and a GELU (Gaussian Error Linear Unit) activation function. Models were pre-trained for S = 1,000,000 updates, with mini-batches containing B = 256 sequences of maximum length T = 512 tokens. Results showed up to 10% jump in accuracy in GLUE, SQuAD and RACE leaderboards.

[10]

FastText, developed by the Facebook AI research (FAIR) team, is a text classification tool suitable to model text involving out of-vocabulary (OOV) words. Zhang et al showed that character level CNN works well for text classification without the need for words. It used four classification algorithms: Logistic regression, Naïve Bayes with SVM, Extreme Gradient Boosting and FastText algorithm with Bidirectional LSTM. The Bidirectional LSTM is a further improvement on the LSTM where the network can read the context in either direction and can be trained using all available input information in the past and future of a specific time. The BiLSTM model was trained on FastText skipgram embedding obtained using Facebook’s fastText algorithm.

[11]

Word embeddings and CNN are compared against BoW approach for text classification methods namely Support Vector Machines (SVM), Naive Bayes (NB), k-Nearest Neighbor (kNN) and Linear Discriminated Analysis (LDA) applied on the designed DTMs. The methods for toxic comment detection employing the dataset. There are six types of toxicity: 'toxic', 'severe toxic', 'obscene', 'threat', 'insult', 'identity hate' in the original dataset, all these categories were considered as toxic in order to convert into binary classification. Finally a statistical analysis was performed on the outcomes of the binary classification. For this purpose they considered: samples labeled as 'toxic' and predicted as 'toxic' as True Positive (TP), samples labeled as 'toxic' and predicted as 'non-toxic' as False Negative (FN), samples labeled as 'non-toxic' and predicted as 'non-toxic' as True Negative (TN) and samples labeled as 'non-toxic' and predicted as 'toxic' as False Positive (FP).

[12]

The paper questions if proceeding to build state-of-art models really worth it considering a lot of difficulties in the way. One of them include a concern about the topic being new and dedicated models are not being developed to serve the purpose. The most intimidating said challenge with the online comments data was that the words are non-standard English full of typos and spurious characters that could severely hurt the performance in classification task. Few of the models used for training the data were NBSVM, BiLSTM and XGBoost. Precision was found to be highest and recall was found to be the lowest in the XGBoost model suggesting the inadequacy of negative examples in the dataset.