Stammdaten

Titel: Text preparation through extended tokenization
Untertitel:
Kurzfassung: Tokenization is commonly understood as the first step of any kind of natural language text preparation. The major goal of this early (pre-linguistic) task is to convert a stream of characters into a stream of precessing units called tokens. Beyond the text mining community this job is taken for granted. Commonly it is seen as an already solved problem comprising the identification of word borders and punctuation marks separated by spaces and line breaks. But in our sense it should manage language related word dependencies, incorporate domain specific knowledge, and handle morphosyntactically relevant linguistic specificities. Therefore, we propose rule-based extended tokenization including all sorts of linguistic knowledge (e.g., grammar rules, dictionaries). The core teatures of our implemantation are identification and disambiguation of all kinds of linguistic markers, detection and expansion of abbreviations, treatment of special formats, and typing of tokens including single- and multi-tokens. To improve the quality of text mining we suggest linguistically-based tokenization as a necessary step preceeding further text processing tasks. In this paper we focus on the task of improving the quality of standard tagging.
Schlagworte:
Publikationstyp: Beitrag in Sammelwerk (Autorenschaft)
Erscheinungsdatum: 07.2006 (Print)
Erschienen in: Data Mining VII; Data, Text and Web Mining, and their Business Applications
Data Mining VII; Data, Text and Web Mining, and their Business Applications
zur Publikation
 ( WIT Press; A. Zanasi, C.A. Brebbia, N.F.F. Ebecken )
Titel der Serie: -
Bandnummer: -
Erstveröffentlichung: Ja
Seite: S. 13 - 21

Versionen

Keine Version vorhanden
Erscheinungsdatum: 07.2006
ISBN:
  • 1-84564-178-7
ISSN: 1746-4463
Homepage: -

AutorInnen

Zuordnung

Organisation Adresse
Fakultät für Technische Wissenschaften
 
Institut für Informatik-Systeme
Universitätsstr. 65-67
A-9020 Klagenfurt
Österreich
  -993503
   kerstin.smounig@aau.at
https://www.aau.at/isys/
zur Organisation
Universitätsstr. 65-67
AT - A-9020  Klagenfurt

Kategorisierung

Sachgebiete
  • 1148 - Computerlinguistik (6633)
  • 1108 - Informatik
  • 1138 - Informationssysteme (5937)
Forschungscluster Kein Forschungscluster ausgewählt
Peer Reviewed
  • Ja
Publikationsfokus
  • Science to Science (Qualitätsindikator: n.a.)
Klassifikationsraster der zugeordneten Organisationseinheiten:
Arbeitsgruppen
  • Software Engineering Research Group (SERG)

Kooperationen

Keine Partnerorganisation ausgewählt

Beiträge der Publikation

Keine verknüpften Publikationen vorhanden