Author identification for Under-Resourced language (KadazanDusun)

This paper presents the task of Author Identification for KadazanDusun language by using tweets as the source of data to perform Author Identification task of short text on KadazanDusun, which is considered as one the under-resourced language in Malaysia. The aim of this paper is to demonstrate Auth...

Full description

Saved in:

Bibliographic Details
Main Authors:	Tarmizi, Nursyahirah (Author), Saee, Suhaila (Author), Abang Ibrahim, Dayang Hanani (Author)
Format:	EJournal Article
Published:	Institute of Advanced Engineering and Science, 2020-01-01.
Subjects:	info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion
Online Access:	Get fulltext
Tags:	Add Tag No Tags, Be the first to tag this record!


LEADER	02610 am a22003133u 4500
001	ijeecs20593_13226
042			\|a dc
100	1	0	\|a Tarmizi, Nursyahirah \|e author
100	1	0	\|e contributor
700	1	0	\|a Saee, Suhaila \|e author
700	1	0	\|a Abang Ibrahim, Dayang Hanani \|e author
245	0	0	\|a Author identification for Under-Resourced language (KadazanDusun)
260			\|b Institute of Advanced Engineering and Science, \|c 2020-01-01.
500			\|a https://ijeecs.iaescore.com/index.php/IJEECS/article/view/20593
520			\|a This paper presents the task of Author Identification for KadazanDusun language by using tweets as the source of data to perform Author Identification task of short text on KadazanDusun, which is considered as one the under-resourced language in Malaysia. The aim of this paper is to demonstrate Author Identification of short text on KadazanDusun. Besides, this paper also examines the performance of two machine learning algorithms on the KadazanDusun data set by analyzing the stylometric features. Stylometric features are used to quantify the writing styles of the authors which includes character n-grams and word n-grams. The workflow of Author Identification implements the machine learning approach to solve the single-labelled multi-class problem and predict the author of a given message in KadazanDusun. Two classifiers are used to compare the accuracy including Naïve Bayes and Support Vector Machine (SVM). The results show that the combination of n-grams which is word-level unigram and {1-5}-grams with character 3-grams are the most relevant stylometric features in identifying the author of KadazanDusun message with an accuracy of 80.17%. The results also show that SVM classifier has outperformed Naive Bayes in this Author Identification task with the accuracy of 80.17%.
540			\|a Copyright (c) 2019 Institute of Advanced Engineering and Science
540			\|a http://creativecommons.org/licenses/by-nc/4.0
546			\|a eng
690
690			\|a Author identification, KadazanDusun, Machine learning, Stylometry, Under-Resourced language
655	7		\|a info:eu-repo/semantics/article \|2 local
655	7		\|a info:eu-repo/semantics/publishedVersion \|2 local
655	7		\|2 local
786	0		\|n Indonesian Journal of Electrical Engineering and Computer Science; Vol 17, No 1: January 2020; 248-255
786	0		\|n 2502-4760
786	0		\|n 2502-4752
786	0		\|n 10.11591/ijeecs.v17.i1
787	0		\|n https://ijeecs.iaescore.com/index.php/IJEECS/article/view/20593/13226
856	4	1	\|u https://ijeecs.iaescore.com/index.php/IJEECS/article/view/20593/13226 \|z Get fulltext

Author identification for Under-Resourced language (KadazanDusun)

Similar Items