Statistical models for the analysis of short user-generated documents

Inches, Giacomo

Back

Doctoral thesis

Statistical models for the analysis of short user-generated documents : author identification for conversational documents

Inches, Giacomo
Crestani, Fabio (Degree supervisor)

29.01.2014

188 p

Thèse de doctorat: Università della Svizzera italiana, 2014

Short user-generated documents

Online user-generated documents

English In recent years short user-generated documents have been gaining popularity on the Internet and attention in the research communities. This kind of documents are generated by users of the various online services: platforms for instant messaging communication, for real-time status posting, for discussing and for writing reviews. Each of these services allows users to generate written texts with particular properties and which might require specific algorithms for being analysed. In this dissertation we are presenting our work which aims at analysing this kind of documents. We conducted qualitative and quantitative studies to identify the properties that might allow for characterising them. We compared the properties of these documents with the properties of standard documents employed in the literature, such as newspaper articles, and defined a set of characteristics that are distinctive of the documents generated online. We also observed two classes within the online user-generated documents: the conversational documents and those involving group discussions. We later focused on the class of conversational documents, that are short and spontaneous. We created a novel collection of real conversational documents retrieved online (e.g. Internet Relay Chat) and distributed it as part of an international competition (PAN @ CLEF'12). The competition was about author characterisation, which is one of the possible studies of authorship attribution documented in the literature. Another field of study is authorship identification, that became our main topic of research. We approached the authorship identification problem in its closed-class variant. For each problem we employed documents from the collection we released and from a collection of Twitter messages, as representative of conversational or short user-generated documents. We proved the unsuitability of standard authorship identification techniques for conversational documents and proposed novel methods capable of reaching better accuracy rates. As opposed to standard methods that worked well only for few authors, the proposed technique allowed for reaching significant results even for hundreds of users.

Language

English

Classification

Computer science and technology

License

License undefined

Identifiers

RERO DOC 255252
URN urn:nbn:ch:rero-006-114051
ARK ark:/12658/srd1318463

Persistent URL

https://n2t.net/ark:/12658/srd1318463

Statistics

Document views: 459 File downloads:

Texte intégral: 343

Doctoral thesis

Statistical models for the analysis of short user-generated documents : author identification for conversational documents

Short user-generated documents

Instant messaging

Chat

Forum

Blog

Online user-generated documents

Conversational documents

Authorship attribution

Authorship identification

Twitter

Mrr

Information retrieval

Text mining

Burstiness

Interlocutors

Statistics