We propose a class of complex weight structures called ranked semirings for the compact representation of morphological analysers based on weighted finite-state automata. In an experiment, we compare this compact representation with the conventional representation based on letter transducers.
JSLIM is a software system for writing grammars in accordance with the SLIM theory of language. Written in Java, it is designed to facilitate the coding of grammars for morphology as well as for syntax and semantics. This paper describes the system with a focus on morphology. We show how the system works, the evolution from previous versions, and how the rules for word form recognition can be used also for word form generation. The first section starts with a basic description of the functionality of a Left Associative Grammar (LAG) and provides an algebraic definition of a JSLIM grammar. The second section deals with the new concepts of JSLIM in comparison with earlier implementations. The third section describes the format of the grammar files, i. e. of the lexicon, of the rules and of the variables. The fourth section broaches the subject of the reversibility of grammar rules with the aim of an automatic word form production without any additional rule system. We conclude with an outlook on current and future developments.
Morphological analysis of a wide range of languages can be implemented efficiently using finite-state transducer technologies. Over the last 30 years, a number of attempts have been made to create tools for computational morphologies. The two main competing approaches have been parallel vs. cascaded rule application. The parallel rule application was originally introduced by Koskenniemi and implemented in tools like TwolC and LexC. Currently many applications of morphologies could use dictionaries encoding the a priori likelihoods of words and expressions as well as the likelihood of relations to other representations or languages. We have made the choice to create open-source tools and language descriptions in order to let as many as possible participate in the effort. The current article presents some of the main tools that we have created such as HFST-LexC, HFST-TwolC and HFST- Compose-Intersect. We evaluate their efficiency in comparison to some similar tools and libraries. In particular, we evaluate them using several full-fledged morphological descriptions. Our tools compare well with similar open source tools, even if we still have some challenges ahead before we can catch up with the commercial tools. We demonstrate that for various reasons a parallel rule approach still seems to be more efficient than a cascaded rule approach when developing finite-state morphologies.
The present article describes fsm2, a software program which can be used interactively or as a script interpreter to manipulate weighted finite-state automata with around 100 different commands. fsm2 is based on FSM<2.0> – an efficient C++ template library to create and algebraically manipulate weighted automata. fsm2 is particularly well suited to create morphological analysers on the basis of weighted automata.
This paper presents the current activities surrounding Morphisto, an open source morphological analyzer/generator for German, based on the SMOR-based SFST tools. Morphisto was designed as a user-friendly application within the eHumanities project TextGrid. It comprises a minimal lexicon component that works in tandem with SMOR and is now being extended within the ELEXIKO project. Additional tools for the management of lexical data and services built on top of the finite state transducer are also integrated as Web Services in the grid, so that all resources can be shared readily among lexicographers, linguists, and finite-state developers.
The MPRO system is a package of programs developed by the Institute for Applied Information Science (IAI) to perform morphological, syntactic and semantic analysis for several languages. Especially for German the MPRO system is a very efficient and robust language software tool. The linguistic analysis within the system is split up into several sub-tasks. In this paper we will present the general MPRO tagging procedure and its sub-tasks in some detail. We will take a closer look at the single program modules and their particular output. The function of each module is described in detail together with the possible parameters of the features given in the particular output. Furthermore the description of each module is enriched with manifold practical examples to demonstrate the functionality of the system. Finally some possible areas of application of the MPRO software package are stated.
Word Manager (WM) is a system for the specification and use of morphological knowledge. It is based on a problem analysis that departs from the usual distinction between a rule component and a lexicon. Instead, it integrates rules and lexical knowledge for morphology in an expert workbench from which components for the performance of certain operational tasks (e.g., analysis or lemmatization). The WM formalism and specification environment have been developed in such a way that the support for individual tasks is optimized. The rule formalism incorporates separate rule types for inflection, word formation, and spelling (i.e., where morphology is not purely concatenative) in the WM core. Phrase Manager includes rules for clitics, periphrastic inflection, and multi-word units. These rules make it possible to encode also German separable verbs adequately. Lexicographic specification is distinct from rule specification and exploits the rules to increase speeconsistency. WM components have been used in various environments.
Integrating the decomposition of unknown morphologically complex words can enhance the recognition rates of morphological analyzers. Using linguistically motivated strategies for this decomposition leads to even more expressive results. The approach described here uses word formation rules and filtering techniques to analyze and decompose words that are not contained in the underlying dictionary database. The average recognition rate of our German analyzers, applied to our test corpus, increased from 91% to 95,4%. Together with the current implementation, further future decomposition strategies will be presented.
Language software applications encounter new words, e.g., acronyms, technical terminology, loan words, names or compounds of such words. To add new words to a morphological lexicon, we need to determine their base form and indicate their inflectional paradigm. A base form and a paradigm define a lexeme. In this article, we evaluate a lexicon-based method augmented with data from a corpus or the internet for generating and ranking lexeme suggestions for new words. As an entry generator often produces numerous suggestions, it is important that the best suggestions be among the first few, otherwise it may become more efficient to create the entries by hand. By generating lexeme suggestions with an entry generator and then further generating some key word forms for the lexemes, we can find support for the lexemes in a corpus. Our ranking methods have 56–79% average precision and 78–89% recall among the top 6 candidates, i.e. an F-score of 65–84%, indicating that the first correct entry suggestion is on the average found as the second or third candidate. The corpus-based ranking methods were found to be significant in practice as they save time for the lexicographer by increasing recall with 7–8% among the top candidates.