Thesis report and implementation for feature extraction based on geographical origin of the author

View the Project on GitHub

Design and Implementation of a Methodology for the Automatic Identification of the User Geographic Idiom in a Social Media Text Corpus


The rapid spread of social media creates more and more issues to investigate and study the scientific community. The sheer volume of information in itself is a challenge in terms of management. The information organized by topic, author, age, gender and geographical origin are examples of problems seeking solution.

Purpose of this project is the development of a methodology for the automatic recognition of regionalization idiom of the author through the corpus of social media. Initially referring to the fields of text classification, knowledge extraction from text and author recognition. Then proceed to the collection of data coming from social media networking namely by users for whom know their origin. Once the collection of text then becomes pretreatment and annotation of text in order to make feature extraction. The feature extraction is based on linguistic elements but also on idioms that betray the geographical origin of the author. Finally we perform classification experiments using several classification algorithms, comparing and evaluating the results as we receive.

Used Tools/Technologies

Having in mind

Text corpus is not available in this repository. Also all details are available in thesis-report.pdf in Greek language and the implementation (code for feature extraction) is available under the /dev subdirectory.


Simakis Panagiotis



Version 3, 29 June 2007

Copyright (C) 2007 Free Software Foundation, Inc. Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed.