The implementation of this project is based on examining two well defined research hypotheses. The first is to create a robust Document Classification System, able to classify and recognize basic information in an amount of various documents such as scanned newspaper articles, journal pages, receipts, invoices or standard forms. Extracting relevant information (document features, keywords and meta-data) from this corpus is necessary, in order to assist later processing steps such as Optical Character Recognition, Monitoring and Data Analysis Systems. The second research task to be examined relates strongly to the Semantic Annotation and Sentiment Analysis of textual information.
Assuming that a document image has been converted to its corresponding digitized form and textual information is available, this project aims to define an innovative Semantic Annotation and Sentiment Analysis framework, by blending image analysis and natural language processing techniques, leading to a short but comprehensive description of the textual source.