Parutions en linguistique japonaise

Liste des parutions en linguistique japonaise (morpho-syntaxe et sémantique du japonais contemporain), en japonais, français et anglais.

Retrouvez les annonces quotidiennes sur la ML " linguistique-japonaise -@- " et @japanese_ling

Aussi: base de données du CiNii et du Kokken.

TitreShrinking Japanese Morphological Analyzers With Neural Networks and Semi-supervised Learning
Type Article dans proceedings
AuteurTolmachev, Arseny
Kawahara, Daisuke
Kurohashi, Sadao
LieuMinneapolis, Minnesota
Editeur scientifique
Nom de la publicationProceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
EditionAssociation for Computational Linguistics
RésuméFor languages without natural word boundaries, like Japanese and Chinese, word segmentation is a prerequisite for downstream analysis. For Japanese, segmentation is often done jointly with part of speech tagging, and this process is usually referred to as morphological analysis. Morphological analyzers are trained on data hand-annotated with segmentation boundaries and part of speech tags. A segmentation dictionary or character n-gram information is also provided as additional inputs to the model. Incorporating this extra information makes models large. Modern neural morphological analyzers can consume gigabytes of memory. We propose a compact alternative to these cumbersome approaches which do not rely on any externally provided n-gram or word representations. The model uses only unigram character embeddings, encodes them using either stacked bi-LSTM or a self-attention network, and independently infers both segmentation and part of speech information. The model is trained in an end-to-end and semi-supervised fashion, on labels produced by a state-of-the-art analyzer. We demonstrate that the proposed technique rivals performance of a previous dictionary-based state-of-the-art approach and can even surpass it when training with the combination of human-annotated and automatically-annotated data. Our model itself is significantly smaller than the dictionary-based one: it uses less than 15 megabytes of space.

blin -- ehess . fr