Forschung | AAU Campus

Publikation: Text preparation through extended token...

Stammdaten

Englisch

Titel:	Text preparation through extended tokenization
Untertitel:
Kurzfassung:	Tokenization is commonly understood as the first step of any kind of natural language text preparation. The major goal of this early (pre-linguistic) task is to convert a stream of characters into a stream of precessing units called tokens. Beyond the text mining community this job is taken for granted. Commonly it is seen as an already solved problem comprising the identification of word borders and punctuation marks separated by spaces and line breaks. But in our sense it should manage language related word dependencies, incorporate domain specific knowledge, and handle morphosyntactically relevant linguistic specificities. Therefore, we propose rule-based extended tokenization including all sorts of linguistic knowledge (e.g., grammar rules, dictionaries). The core teatures of our implemantation are identification and disambiguation of all kinds of linguistic markers, detection and expansion of abbreviations, treatment of special formats, and typing of tokens including single- and multi-tokens. To improve the quality of text mining we suggest linguistically-based tokenization as a necessary step preceeding further text processing tasks. In this paper we focus on the task of improving the quality of standard tagging.
Schlagworte:

Publikationstyp:	Beitrag in Sammelwerk (Autorenschaft)
Erscheinungsdatum:	07.2006 (Print)
Erschienen in:	Data Mining VII; Data, Text and Web Mining, and their Business Applications Data Mining VII; Data, Text and Web Mining, and their Business Applications zur Publikation ( WIT Press; A. Zanasi, C.A. Brebbia, N.F.F. Ebecken )
Titel der Serie:	-
Bandnummer:	-
Erstveröffentlichung:	Ja
Seite:	S. 13 - 21

Versionen

Keine Version vorhanden

Erscheinungsdatum:	07.2006
ISBN:	1-84564-178-7
ISSN:	1746-4463
Homepage:	-

AutorInnen

(intern)

Zuordnung

Organisation

Adresse

Fakultät für Technische Wissenschaften

Universitätsstr. 65-67
AT - A-9020 Klagenfurt

Kategorisierung

Sachgebiete	1148 - Computerlinguistik (6633) 1108 - Informatik 1138 - Informationssysteme (5937)
Forschungscluster	Kein Forschungscluster ausgewählt
Peer Reviewed	Ja
Publikationsfokus	Science to Science (Qualitätsindikator: n.a.) Klassifikationsraster der zugeordneten Organisationseinheiten: Institut für Informatik-Systeme
Arbeitsgruppen	Software Engineering Research Group (SERG)

Kooperationen

Keine Partnerorganisation ausgewählt

Forschungsaktivitäten

Hier werden alle mit dieser Publikation in Zusammenhang stehenden Forschungsaktivitäten angezeigt. Mit dem untenstehenden Link können sie sich diese Forschungsaktivitäten in der Suche anzeigen lassen und gegebenenfalls exportieren.

(Achtung: Externe Aktivitäten werden im Suchergebnis nicht mitangezeigt)

Zugehörige Forschungsaktivitäten in der Suche anzeigen

Projekte:	Keine verknüpften Projekte vorhanden
Publikationen:	Keine verknüpften Publikationen vorhanden
Veranstaltungen:	Keine verknüpften Veranstaltung vorhanden
Vorträge:	Keine verknüpften Vorträge vorhanden

Beiträge der Publikation

Keine verknüpften Publikationen vorhanden