Identification of product categories from advertising catalogs

Identification of product categories from advertising catalogs

Céline Alec Chantal Reynaud-Delaître Brigitte Safar Zied Sellami Uriel Berdugo 

LRI, Univ. Paris-Sud, CNRS, Université Paris-Saclay, Orsay, France

Linagora, 100 Terrasse Boieldieu - Tour Franklin, Paris - La Défense, France

Wepingo, 6 Cour Saint Eloi, Paris, France

Corresponding Author Email: 
celine.alec@lri.fr, chantal.reynaud@lri.fr, brigitte.safar@lri.fr, zsellami@linagora.com, uriel.berdugo@wepingo.com
Page: 
557-578
|
DOI: 
https://doi.org/10.3166/RIA.30.557-578
Received: 
N/A
| |
Accepted: 
N/A
| | Citation

OPEN ACCESS

Abstract: 

In this paper, we propose an approach of information extraction, based on an ontology, and applied to documents from advertising catalogs. Documents are relatively poor descriptions of products. The information to be extracted, or annotations, concern the categories and features of the products, listed in a domain ontology. Thus, the information extraction about a product is actually an ontology population process, more precisely the population of concepts representing its categories and features. The poverty of the descriptions makes a fully automatic population impossible. We propose a two-step approach: (1) a first semi-Automatic annotation step, which covers a small set of documents; (2) a second step, which annotates all other documents, in an entirely automatic way, based on machine learning mechanisms exploiting the results of the first step. The originality of this work relies on an incremental approach to refine the extracted information. The work described has been applied on real data, in the toy domain.

Keywords: 

information extraction, ontology population, semantic annotation, B2C application.

1. Introduction
2. Cadre de travail
3. État de l’art
4. Proposition d’une approche de peuplement d’ontologie
5. Évaluation de l’approche
6. Conclusion et perspectives
Remerciements

Nous remercions la société Wepingo qui a financé ce travail dans le cadre du projet PORASO.

  References

Amardeilh F., Damljanovic D. (2009). Du texte à la connaissance : annotation sémantique et peuplement d’ontologie appliqués à des artefacts logiciels. In F. L. Gandon (Ed.), Journées Francophones d’Ingénierie des Connaissances (IC), p. 157-168. Hammamet, Tunisie, PUG.

Amardeilh F., Laublet P., Minel J.-L. (2005). Document annotation and ontology population from linguistic extractions. In Proceedings of the 3rd international conference on Knowledge Capture (K-CAP), p. 161–168. New York, NY, USA, ACM.

Aussenac-Gilles N., Kamel M., Comparot C., Buscaldi D. (2013, juillet). Construction d’ontologies à partir de pages web structurées. In R. Troncy (Ed.), Journées Francophones d’Ingénierie des Connaissances (IC), p. 1–17. Lille, France, AFIA.

Barriere C., Agbago A. (2006). Terminoweb: a software environment for term study in rich contexts. In Proceedings of the 2005 international conference on terminology, standardization and technology transfer, p. 103–113.

Béchet N., Aufaure M.-A., Lechevallier Y. (2012, mai). Construction et peuplement de structures hiérarchiques de concepts dans le domaine du e-tourisme. In Journées Francophones d’Ingénierie des Connaissances (IC), p. 475-490. Chambéry, France. Consulté sur http://hal.archives-ouvertes.fr/hal-00746719

Bontcheva K., Tablan V., Maynard D., Cunningham H. (2004). Evolving GATE to Meet New Challenges in Language Engineering. Natural Language Engineering, vol. 10, no 3/4, p. 349–373.

Cortes C., Vapnik V. (1995, septembre). Support-Vector Networks. Machine Learning, vol. 20, no 3, p. 273–297.

Fan R.-E., Chang K.-W., Hsieh C.-J.,Wang X.-R., Lin C.-J. (2008). LIBLINEAR: A Library for Large Linear Classification. Journal of Machine Learning Research, vol. 9, p. 1871–1874. (Software available at http://www.csie.ntu.edu.tw/~cjlin/liblinear)

Garon D., Filion R., Chiasson R. (2002). Le système ESAR: guide d’analyse, de classification et d’organisation d’une collection de jeux et jouets. Editions ASTED.

Gruber T. R. (1993). A translation approach to portable ontology specifications. Knowledge Acquisition, vol. 5, no 2, p. 199-220.

Hsu C.-W., Chang C.-C., Lin C.-J. (2003). A Practical Guide to Support Vector Classification. Rapport technique. Department of Computer Science, National Taiwan University. Consulté sur http://www.csie.ntu.edu.tw/~cjlin/papers.html

Kessler R., Béchet N., Roche M., Moreno J. M. T., El-Bèze M. (2012). A Hybrid Approach to Managing Job Offers and Candidates. Information Processing and Management, vol. 48, no 6, p. 1124-1135.

Manning C. D., Schütze H. (1999). Foundations of statistical natural language processing. Cambridge, Massachusetts, The MIT Press.

Petasis G., Karkaletsis V., Paliouras G., Krithara A., Zavitsanos E. (2011). Ontology Population and Enrichment: State of the Art. In Knowledge-driven multimedia information extraction and ontology evolution, p. 134-166.

Popov B., Kiryakov A., Ognyanoff D., Manov D., Kirilov A. (2004, septembre). KIM – a Semantic Platform for Information Extraction and Retrieval. Natural Language Engineering, vol. 10, no 3-4, p. 375–392.

Reeve L. (2005). Survey of semantic annotation platforms. In Proceedings of the 2005 acm symposium on applied computing, p. 1634–1638. ACM Press.

Reymonet A., Thomas J., Aussenac-Gilles N. (2007). Modélisation de Ressources Termino-Ontologiques en OWl. In F. Trichet (Ed.), Journées Francophones d’Ingénierie des Connaissances (IC), p. 169-181. Grenoble, France, Cepadues.

Salton G., McGill M. J. (1986). Introduction to Modern Information Retrieval. New York, NY, USA, McGraw-Hill, Inc.

Suchanek F. M., Sozio M., Weikum G. (2009). SOFIE: a Self-Organizing Framework for Information Extraction. In World Wide Web Conference (WWW), p. 631-640. Madrid, Spain, ACM.