PCA REDUCED FOREST FOR LEARNING TO RANK SPOKEN TRANSCRIPTIONS

Document Type : Original Article

Authors

Computer Engineering Department, Faculty of Engineering, Cairo University

Abstract

This paper discusses the problem of learning to rank specially for spoken transcriptions. The state-of-art approach for text/web documents is to apply machine learning techniques to learn a ranking model from labeled query-documents pairs with their features. One of the best state-of-art learning algorithms is the Random Forest, however it does not perform very well when features are dependent or are monotonic transformation of other features as this makes the trees of the forest less independent. We propose to use Principal Component Analysis (PCA) to bags of features, in order to reduce them to simplify the model and have a surrogate score for each field's features producing more independent set of features for the Random Forest. Using this technique for a transcriptions dataset,  4.32% improvement in terms of Expected Reciprocal Rank (ERR@10) and 0.4% improvement in terms of Normalized Discounted Cumulative Gain (NDCG@10) for training data are achieved with very comparable results for the testing data. We emphasized the effectiveness of the technique by applying it to the larger and benchmarked web documents dataset; Microsoft LETOR. An improvement of 7.99%  and 1.29% for test data are achieved for the two used metrics respectively.