Using reinforcement learning to continuously improve a document treatment chain

Using reinforcement learning to continuously improve a document treatment chain

Esther Nicart Bruno Zanuttini Bruno Grilhères Patrick Giroux Arnaud Saval 

Cordon Electronics DS2i, 27000 Val de Reuil, France

Normandie Univ, UNICAEN, ENSICAEN, CNRS, GREYC, 14000 Caen, France

Airbus Defence and Space, Élancourt, France

Corresponding Author Email:;;;;;
31 December 2017
| Citation



We model a document treatment chain as a Markov Decision Process, and use reinforcement learning to allow the agent to learn to construct and continuously improve custommade chains “on the fly”. We build a platform which enables us to measure the impact on the learning of various models, web services, algorithms, parameters, etc. We apply this in an industrial setting, specifically to an open source document treatment chain which extracts events from massive volumes of web pages and other open-source documents. Our emphasis is on minimising the burden of the human analysts, from whom the agent learns to improve guided by their feedback on the events extracted. For this, we investigate different types of feedback, from numerical feedback, which requires a lot of tuning, to partially and even fully qualitative feedback, which is much more intuitive, and demands little to no user calibration. We carry out experiments, first with numerical feedback, then demonstrate that intuitive feedback still allows the agent to learn effectively.


artificial intelligence, reinforcement learning, extraction and knowledge management, man-machine interaction, open source intelligence (OSINT)

1. Introduction
2. La plateformeWebLab
3. Apprentissage par renforcement
4. Amélioration continue via l’apprentissage par renforcement
5. Cadre expérimental
6. Mesure de la qualité des résultats
7. Tests avec un feedback numérique
8. Tests avec un feedback intuitif
9. Conclusion et perspectives

Les auteurs veulent remercier Hugo Gilbert pour les fructueuses discussions sur les feedbacks qualitatifs, ainsi que les reviewers anonymes d’IC2015 et de la RIA pour leurs retours constructifs.


Akrour R., Schoenauer M., Sebag M. (2011). Preference-based policy learning. In Machine Learning and Knowledge Discovery in Databases, p. 12–27. Springer.

Akrour R., Schoenauer M., Sebag M. (2012). APRIL: Active preference learning-based reinforcement learning. In Machine Learning and Knowledge Discovery in Databases, vol. 7524, p. 116–131. Springer Berlin Heidelberg.

Amann B., Constantin C., Caron C., Giroux P. (2013, mars). WebLab PROV: Computing finegrained provenance links for XML artifacts. In BIGProv’13 Workshop (in conjunction with EDBT/ICDT), p. 298-306. Gênes, Italy, ACM.

Azaria A., Rabinovich Z., Kraus S., Goldman C. V., Gal Y. (2012). Strategic advice provision in repeated human-agent interactions. Institute for Advanced Computer Studies University of Maryland, vol. 1500, p. 20742.

Brafman R. I., Tennenholtz M. (2003). R-max-a general polynomial time algorithm for nearoptimal reinforcement learning. The Journal of Machine Learning Research, vol. 3, p. 213–231.

Bratko I., Suc D. (2003). Learning qualitative models. Artificial Intelligence, vol. 24, no 4, p. 107.

Busa-Fekete R., Szörényi B.,Weng P., ChengW., Hüllermeier E. (2014, décembre). Preferencebased reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm. Machine Learning, vol. 97, no 3, p. 327–351.

Camel. (2015). Apache Camel. (Accessed: 2015-03-17)

Caron C., Amann B., Constantin C., Giroux P., Santanchè A. (2014). Provenance-based quality assessment and inference in data-centric workflow executions. In On the move to meaningful internet systems: Otm 2014 conferences, p. 130–147.

Cohen W. W., Dalvi B. B., Cohen B. D. W. W. (2013). Very Fast Similarity Queries on Semi-Structured Data from the Web. In SDM, p. 512–520.

Cunningham H., Maynard D., Bontcheva K., Tablan V., Aswani N., Roberts I. et al. (2014). Developing Language Processing Components with GATE Version 8 (a User Guide). (Accessed: 2014-12-17)

Doucy J., Abdulrab H., Giroux P., Kotowicz J.-P. (2008). Méthodologie pour l’orchestration sémantique de services dans le domaine de la fouille de documents multimédia.

Dutkiewicz J., Je¸drzejek C., Cybulka J., Falkowski M. (2013). Knowledge-based highlyspecialized terrorist event extraction. RuleML2013 Challenge, Human Language Technology and Doctoral Consortium, p. 1.

Fishburn P. C. (1984). SSB utility theory: an economic perspective. Mathematical Social Sciences, vol. 8, no 1, p. 63 - 94. Consulté sur

Formiga L., Barrón-Cedeño A., Màrquez L., Henríquez C. A., Mariño J. B. (2015). Leveraging online user feedback to improve statistical machine translation. Journal of Artificial Intelligence Research, vol. 54, p. 159–192.

Fromherz M. P., Bobrow D. G., De Kleer J. (2003). Model-based computing for design and control of reconfigurable systems. AI magazine, vol. 24, no 4, p. 120.

Fürnkranz J., Hüllermeier E., Cheng W., Park S.-H. (2012, octobre). Preference-based reinforcement learning: a formal framework and a policy iteration algorithm. Machine Learning, vol. 89, no 1-2, p. 123–156.

GATE. (2016). GATE Information Extraction. (Accessed: 2016-06-20)

Geonames. (2015). Geonames. (Accessed: 2015-03-17)

Gilbert H., Zanuttini B., Viappiani P., Weng P., Nicart E. (2016). Model-free reinforcement learning with skew-symmetric bilinear utilities. (Accepted at UAI16. Available at

Ginstrom R. (2007). The GITS Blog: Fuzzy substring matching with Levenshtein distance in Python. (Accessed: 2014-08-19)

Hage W. R. van, Malaisé V., Segers R., Hollink L., Schreiber G. (2011, juillet). Design and use of the Simple Event Model (SEM). Web Semantics: Science, Services and Agents on the World Wide Web, vol. 9, no 2, p. 128–136.

Karami A. B., Sehaba K., Encelle B. (2014, mai). Apprentissage de connaissances d’adaptation à partir des feedbacks des utilisateurs. In 25es Journées francophones d’Ingénierie des Connaissances, p. 125–136.

Knox W. B., Stone P. (2015). Framing reinforcement learning from human reward: Reward positivity, temporal discounting, episodicity, and performance. Artificial Intelligence, vol. 225, p. 24–50.

LaFree G. (2010). The Global Terrorism Database: Accomplishments and Challenges | LaFree | Perspectives on Terrorism. Perspectives on Terror, vol. 4, no 1.

Loftin R., Peng B., MacGlashan J., Littman M. L., Taylor M. E., Huang J. et al. (2016-01). Learning behaviors via human-delivered discrete feedback: modeling implicit feedback strategies to speed up learning. Autonomous Agents and Multi-Agent Systems, vol. 30, no 1, p. 30–59.

NGramJ. (2015). NGramJ, smart scanning for document properties. http://ngramj.sourceforge .net/. (Accessed: 2015-02-18)

Nicart E., Zanuttini B., Grilhères B., Praca F. (2016). Dora Q-learning - making better use of explorations. In D. Pellier (Ed.), Proc. 11es journées francophones sur la planification, la décision et l’apprentissage pour la conduite de systèmes (jfpda 2016).

Nicart E., Zanuttini B., Grilhères B., Giroux P. (2015). Amélioration continue d’une chaîne de traitement de documents avec l’apprentissage par renforcement. In Actes des 26es journées francophones d’Ingénierie des Connaissances (IC 2105).

Ogrodniczuk M., Przepiórkowski A. (2010). Linguistic Processing Chains as Web Services: Initial Linguistic Considerations. In Proceedings of the Workshop on Web Services and Processing

Pipelines in HLT: Tool Evaluation, LR Production and Validation (WSPP 2010) at the Language Resources and Evaluation Conference (LREC 2010), p. 1–7.

Pandit S., Gupta S., others. (2011). A comparative study on distance measuring approaches for clustering. International Journal of Research in Computer Science, vol. 2, no 1, p. 29–31.

Puterman M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming 1st. John Wiley & Sons, Inc. New York, NY, USA.

Rao K., Whiteson S. (2011). V-MAX: A General Polynomial Time Algorithm for Probably Approximately Correct Reinforcement Learning. Thèse de doctorat non publiée, Amsterdam.

Rodrigues F., Oliveira N., Barbosa L. (2015). Towards an engine for coordination-based architectural reconfigurations. Computer Science and Information Systems, vol. 12, no 2, p. 607–634.

Saïs F., Serrano L., Khefifi R., Scharffe F. (2013). SOS-DLWD 2013.

Serrano L. (2014). Vers une capitalisation des connaissances orientée utilisateur: extraction et structuration automatiques de l’information issue de sources ouvertes. Thèse de doctorat non publiée, Universté de Caen.

Sutton R. J., Barto A. G. (1998). Reinforcement learning: An introduction. MIT press.

Szepesvári C. (2010). Algorithms for reinforcement learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, vol. 4, no 1, p. 1–103.

Tika. (2015). Apache Tika - a content analysis toolkit. (Accessed: 2015-02-18)

Tversky A., Gati I. (1978). Studies of similarity. Cognition and categorization, vol. 1, no 1978, p. 79–98.

Watkins C. J. C. H. (1989). Learning From Delayed Rewards. Thèse de doctorat non publiée, Kings College.

WebLab. (2015). WebLab wiki. (Accessed: 2015-03-17)

Wilson A., Fern A., Tadepalli P. (2012). A bayesian approach for policy learning from trajectory preference queries. In Advances in neural information processing systems, p. 1133–1141.

Wirth C., Fürnkranz J. (2013). EPMC: Every visit preference monte carlo for reinforcement learning. In Asian conference on machine learning, ACML 2013, canberra, ACT, australia, november 13-15, 2013, p. 483–497.

Wirth C., Neumann G. (2015). Model free preference-based reinforcement learning. In EWRL.