Home Journals AMA_A Design and Evaluation of Text Pre-Processor: A Tool for Text Pre-Processing

JOURNAL METRICS

CiteScore 2018: N/A ℹCiteScore:

CiteScore is the number of citations received by a journal in one year to documents published in the three previous years, divided by the number of documents indexed in Scopus published in those same three years.

SCImago Journal Rank (SJR) 2018: N/A ℹSCImago Journal Rank (SJR):

The SJR is a size-independent prestige indicator that ranks journals by their 'average prestige per article'. It is based on the idea that 'all citations are not created equal'. SJR is a measure of scientific influence of journals that accounts for both the number of citations received by a journal and the importance or prestige of the journals where such citations come from It measures the scientific influence of the average article in a journal, it expresses how central to the global scientific discussion an average article of the journal is.

Source Normalized Impact per Paper (SNIP): N/A ℹSource Normalized Impact per Paper(SNIP):

SNIP measures a source’s contextual citation impact by weighting citations based on the total number of citations in a subject field. It helps you make a direct comparison of sources in different subject fields. SNIP takes into account characteristics of the source's subject field, which is the set of documents citing that source.

qqtu_pian_20240428144739.png

Design and Evaluation of Text Pre-Processor: A Tool for Text Pre-Processing

Amit Prasad Rauth| Anjan Pal

Department of Computer Science & Engineering, OmDayal College of Engineering & Architecture, 39(P) & 39(A), Uluberia, West Bengal-711316, India

Corresponding Author Email:

amitrauth1234@gmail.com, anjanpal5@gmail.com

Received:

25 September 2017

| |

Accepted:

28 September 2017

| | Citation

54.02_03.pdf

OPEN ACCESS

Abstract:

This paper introduces the Text Pre-processor, a tool that integrates several text pre-processing tasks such as tokenization, parts-of-speech tagging, and elimination of stop words. These pre-processing tasks are prerequisite for any text processing tasks such as sentiment analysis or text summarization. However, there does not exist any one-stop solution to perform multiple text pre-processing tasks. The Text Pre-processor serves to cover this gap. The tool includes five modules. These include text editor, single file processing, file to file processing, multiple file processing, as well as split and merge files. Informed by the technological acceptance model, a qualitative user study was conducted to evaluate the efficacy of the tool. Participants generally found the tool efficacious.

Keywords:

Natural language processing, Text processing, Text pre-processing, Text mining tool.

1. Introduction

2. Motivation and Significance

3. Features of the Tool

4. Tasks Performed by Text Pre-processor

5. Design and Implementation of Text Pre-processor

6. Evaluation of Text Pre-processor

7. Conclusion

Acknowledgment

The authors would like to thank Mullick Kabirul Huda, Subhendu Maity, Subhajit Das, and Sk. Abdul Nasim for their help in this project.

References

1. I. Pollach, Taming textual data: The contribution of corpus linguistics to computer-aided text analysis, Organizational Research Methods, vol. 15, 2012, pp. 263-287.

2. S. Pletscher-Frankild, A. Pallejà, K. Tsafou, J. X. Binder, L. J. Jensen, DISEASES: Text mining and data integration of disease–gene associations, Methods, vol. 74, 2015, pp. 83-89.

3. S.A. Crossley, L.K. Allen, K. Kyle, D. S. McNamara, Analyzing discourse processing using a simple natural language processing tool, Discourse Processes, vol. 51, 2014, pp. 511-534.

4. P. Velardi, P. Fabriani, M. Missikoff, Using text processing techniques to automatically enrich a domain ontology, Proceedings of the International Conference on Formal Ontology in Information Systems-Volume 2001, 2001, pp. 270-284.

5. E. Haddi, X. Liu, Y. Shi, The role of text pre-processing in sentiment analysis, Procedia Computer Science, vol. 17, 2013, pp. 26-32.

6. D. Munková, M. Munk, M. Vozár, Data pre-processing evaluation for text mining: Transaction/sequence model, Procedia Computer Science, vol. 18, 2013, pp. 1198-1207.

7. Z. Ceska, C. Fox, The influence of text pre-processing on plagiarism detection, Proceedings of the Association for Computational Linguistics, 2011, pp. 55-59.

8. N.A. Ghani, S.S.M. Kamal, A sentiment-based filteration and data analysis framework for social media, Proceedings of the International Conference on Computing and Informatics, 2015, pp. 632-637.

9. S. Banerjee, A.Y.K. Chua, J.J. Kim, Using supervised learning to classify authentic and fake online reviews, Proceedings of the International Conference on Ubiquitous Information Management and Communication, 2015, pp. 88:1-7.

10. M. Ott, C. Cardie, J.T. Hancock, Finding deceptive opinion spam by any stretch of the imagination, Proceedings of the Association for Computational Linguistics, 2011, pp. 309-319.

11. W.J. Wilbur, K. Sirotkin, The automatic identification of stop words, Journal of Information Science, vol. 18, 1992, pp. 45-55.

12. L.E. Holzman, T.A. Fisher, L.M. Galitsky, A. Kontostathis, W.M. Pottenger, A software infrastructure for research in textual data mining, International Journal on Artificial Intelligence Tools, vol. 13, 2004, pp. 829-849.

13. D. Ho, Notepad++, 2011, Retrived from: http://notepad-plus-plus. org.

14. R. Godwin-Jones, Emerging technologies: Web-writing 2.0: Enabling, documenting, and assessing writing online, 2008, Language Learning & Technology, vol. 12, pp. 7-13.

15. C.C. von Bastian, A. Locher, M. Ruflin, Tatool, A Java-based open-source programming framework for psychological studies, 2013, Behavior Research Methods, vol. 45, pp. 108-115.

16. G. Neumann, J. Piskorski, A shallow text processing core engine, Computational Intelligence, vol. 18, 2002, pp. 451-476.

17. V. Tunali, T.T. Bilgin, PRETO: A high-performance text mining tool for preprocessing Turkish texts, 2012, Proceedings of the International Conference on Computer Systems and Technologies, pp. 134-140.

18. K. Shi, L. Li, High performance genetic algorithm based text clustering using parts of speech and outlier elimination, 2013, Applied Intelligence, vol. 38, pp. 511-519.

19. A.K. Uysal, S. Gunal, The impact of preprocessing on text classification, 2014, Information Processing & Management, vol. 50, pp. 104-112.

20. M.K. Dalal, M.A. Zaveri, Automatic classification of unstructured blog text, Journal of Intelligent Learning Systems and Applications, vol. 5, 2013, pp. 108-114.

21. N. Jindal, B. Liu, Identifying comparative sentences in text documents, Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, 2006, pp. 244-251.

22. M. Pontiki, D. Galanis, J. Pavlopoulos, H. Papageorgiou, I. Androutsopoulos, S. Manandhar, Semeval-2014 task 4: Aspect based sentiment analysis, Proceedings of the International Workshop on Semantic Evaluation, 2014, pp. 27-35.

23. M. Mitray, A. Singhalz, C. Buckleyyy, Automatic text summarization by paragraph extraction, Compare, vol. 22215, 1997, pp. 26:1-11.

24. R. Feldman, Techniques and applications for sentiment analysis, Communications of the ACM, vol. 56, 2013, pp. 82-89.

25. S.N. Kim, O. Medelyan, M.Y. Kan, T. Baldwin, Automatic key phrase extraction from scientific articles, Language Resources and Evaluation, vol. 47, 2013, pp. 723-742.

26. E.W. Brown, J.P. Callan, W.B. Croft, Fast incremental indexing for full-text information retrieval, Proceedings of the Very Large Data Bases Conference, 1994, Internet: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.46.6221&rep=rep1&type=pdf [Dec 26, 2015].

27. T. Erjavec, C. Ignat, B. Pouliquen, R. Steinberger, Massive multi lingual corpus compilation: Acquis communautaire and totale, Archives of Control Science, vol. 15, 2005, pp. 529-540.

28. S.C. Lewis, R. Zamith, A. Hermida, Content analysis in an era of big data: A hybrid approach to computational and manual methods, Journal of Broadcasting & Electronic Media, vol. 57, 2013, pp. 34-52.

29. F.D. Davis, R.P. Bagozzi, P.R. Warshaw, User acceptance of computer technology: A comparison of two theoretical models, Management Science, vol. 35, 1989, pp. 982-1003.

30. L.G. Wallace, S.D. Sheetz, The adoption of software measures: A technology acceptance model (TAM) perspective, Information & Management, vol. 51, 2014, pp. 249-259.

31. G.H. Subramanian, A replication of perceived usefulness and perceived ease of use measurement, Decision Sciences, vol. 25, 1994, pp. 863-874.

32. F.D. Davis, Perceived usefulness, perceived ease of use, and user acceptance of information technology, MIS Quarterly, vol. 13, 1989, pp. 318-340.

IJHT
MMEP
ACSM
EJEE
ISI
I2M
JESA
RCMA
RIA
TS
IJSDP
IJSSE
IJDNE
JNMES
IJES
EESRJ
RCES
AMA_A
AMA_B
AMA_C
AMA_D
MMC_A
MMC_B
MMC_C
MMC_D

Username
Password
Remember me

Search form

Design and Evaluation of Text Pre-Processor: A Tool for Text Pre-Processing