Design and Evaluation of Text Pre-Processor: A Tool for Text Pre-Processing

Design and Evaluation of Text Pre-Processor: A Tool for Text Pre-Processing

Amit Prasad Rauth Anjan Pal 

Department of Computer Science & Engineering, OmDayal College of Engineering & Architecture, 39(P) & 39(A), Uluberia, West Bengal-711316, India

Corresponding Author Email: 
amitrauth1234@gmail.com, anjanpal5@gmail.com
Page: 
169-183
|
DOI: 
https://doi.org/10.18280/ama_a.540203
Received: 
25 September 2017
| |
Accepted: 
28 September 2017
| | Citation

OPEN ACCESS

Abstract: 

This paper introduces the Text Pre-processor, a tool that integrates several text pre-processing tasks such as tokenization, parts-of-speech tagging, and elimination of stop words. These pre-processing tasks are prerequisite for any text processing tasks such as sentiment analysis or text summarization. However, there does not exist any one-stop solution to perform multiple text pre-processing tasks. The Text Pre-processor serves to cover this gap. The tool includes five modules. These include text editor, single file processing, file to file processing, multiple file processing, as well as split and merge files. Informed by the technological acceptance model, a qualitative user study was conducted to evaluate the efficacy of the tool. Participants generally found the tool efficacious.

Keywords: 

Natural language processing, Text processing, Text pre-processing, Text mining tool.

1. Introduction
2. Motivation and Significance
3. Features of the Tool
4. Tasks Performed by Text Pre-processor
5. Design and Implementation of Text Pre-processor
6. Evaluation of Text Pre-processor
7. Conclusion
Acknowledgment

The authors would like to thank Mullick Kabirul Huda, Subhendu Maity, Subhajit Das, and Sk. Abdul Nasim for their help in this project.

  References

1. I. Pollach, Taming textual data: The contribution of corpus linguistics to computer-aided text analysis, Organizational Research Methods, vol. 15, 2012, pp. 263-287. 

2. S. Pletscher-Frankild, A. Pallejà, K. Tsafou, J. X. Binder, L. J. Jensen, DISEASES: Text mining and data integration of disease–gene associations, Methods, vol. 74, 2015, pp. 83-89. 

3. S.A. Crossley, L.K. Allen, K. Kyle, D. S. McNamara, Analyzing discourse processing using a simple natural language processing tool, Discourse Processes, vol. 51, 2014, pp. 511-534. 

4. P. Velardi, P. Fabriani, M. Missikoff, Using text processing techniques to automatically enrich a domain ontology, Proceedings of the International Conference on Formal Ontology in Information Systems-Volume 2001, 2001, pp. 270-284.

5. E. Haddi, X. Liu, Y. Shi, The role of text pre-processing in sentiment analysis, Procedia Computer Science, vol. 17, 2013, pp. 26-32. 

6. D. Munková, M. Munk, M. Vozár, Data pre-processing evaluation for text mining: Transaction/sequence model, Procedia Computer Science, vol. 18, 2013, pp. 1198-1207. 

7. Z. Ceska, C. Fox, The influence of text pre-processing on plagiarism detection, Proceedings of the Association for Computational Linguistics, 2011, pp. 55-59.

8. N.A. Ghani, S.S.M. Kamal, A sentiment-based filteration and data analysis framework for social media, Proceedings of the International Conference on Computing and Informatics, 2015, pp. 632-637.

9. S. Banerjee, A.Y.K. Chua, J.J. Kim, Using supervised learning to classify authentic and fake online reviews, Proceedings of the International Conference on Ubiquitous Information Management and Communication, 2015, pp. 88:1-7.

10. M. Ott, C. Cardie, J.T. Hancock, Finding deceptive opinion spam by any stretch of the imagination, Proceedings of the Association for Computational Linguistics, 2011, pp. 309-319.

11. W.J. Wilbur, K. Sirotkin, The automatic identification of stop words, Journal of Information Science, vol. 18, 1992, pp. 45-55. 

12. L.E. Holzman, T.A. Fisher, L.M. Galitsky, A. Kontostathis, W.M. Pottenger, A software infrastructure for research in textual data mining, International Journal on Artificial Intelligence Tools, vol. 13, 2004, pp. 829-849. 

13. D. Ho, Notepad++, 2011, Retrived from: http://notepad-plus-plus. org.

14. R. Godwin-Jones, Emerging technologies: Web-writing 2.0: Enabling, documenting, and assessing writing online, 2008, Language Learning & Technology, vol. 12, pp. 7-13.

15. C.C. von Bastian, A. Locher, M. Ruflin, Tatool, A Java-based open-source programming framework for psychological studies, 2013, Behavior Research Methods, vol. 45, pp. 108-115. 

16. G. Neumann, J. Piskorski, A shallow text processing core engine, Computational Intelligence, vol. 18, 2002, pp. 451-476. 

17. V. Tunali, T.T. Bilgin, PRETO: A high-performance text mining tool for preprocessing Turkish texts, 2012, Proceedings of the International Conference on Computer Systems and Technologies, pp. 134-140.

18. K. Shi, L. Li, High performance genetic algorithm based text clustering using parts of speech and outlier elimination, 2013, Applied Intelligence, vol. 38, pp. 511-519. 

19. A.K. Uysal, S. Gunal, The impact of preprocessing on text classification, 2014, Information Processing & Management, vol. 50, pp.  104-112. 

20. M.K. Dalal, M.A. Zaveri, Automatic classification of unstructured blog text, Journal of Intelligent Learning Systems and Applications, vol. 5, 2013, pp. 108-114. 

21. N. Jindal, B. Liu, Identifying comparative sentences in text documents, Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, 2006, pp. 244-251.

22. M. Pontiki, D. Galanis, J. Pavlopoulos, H. Papageorgiou, I. Androutsopoulos, S. Manandhar, Semeval-2014 task 4: Aspect based sentiment analysis, Proceedings of the International Workshop on Semantic Evaluation, 2014, pp. 27-35.

23. M. Mitray, A. Singhalz, C. Buckleyyy, Automatic text summarization by paragraph extraction, Compare, vol. 22215, 1997, pp. 26:1-11.

24. R. Feldman, Techniques and applications for sentiment analysis, Communications of the ACM, vol. 56, 2013, pp. 82-89. 

25. S.N. Kim, O. Medelyan, M.Y. Kan, T. Baldwin, Automatic key phrase extraction from scientific articles, Language Resources and Evaluation, vol. 47, 2013, pp. 723-742. 

26. E.W. Brown, J.P. Callan, W.B. Croft, Fast incremental indexing for full-text information retrieval, Proceedings of the Very Large Data Bases Conference, 1994, Internet: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.46.6221&rep=rep1&type=pdf [Dec 26, 2015].

27. T. Erjavec, C. Ignat, B. Pouliquen, R. Steinberger, Massive multi lingual corpus compilation: Acquis communautaire and totale, Archives of Control Science, vol. 15, 2005, pp. 529-540.

28. S.C. Lewis, R. Zamith, A. Hermida, Content analysis in an era of big data: A hybrid approach to computational and manual methods, Journal of Broadcasting & Electronic Media, vol. 57, 2013, pp. 34-52. 

29. F.D. Davis, R.P. Bagozzi, P.R. Warshaw, User acceptance of computer technology: A comparison of two theoretical models, Management Science, vol. 35, 1989, pp. 982-1003. 

30. L.G. Wallace, S.D. Sheetz, The adoption of software measures: A technology acceptance model (TAM) perspective, Information & Management, vol. 51, 2014, pp. 249-259. 

31. G.H. Subramanian, A replication of perceived usefulness and perceived ease of use measurement, Decision Sciences, vol. 25, 1994, pp. 863-874. 

32. F.D. Davis, Perceived usefulness, perceived ease of use, and user acceptance of information technology, MIS Quarterly, vol. 13, 1989, pp. 318-340.