HeLI-OTS 2.0 – META-SHARE

Last view: 2024-04-26

11 Last view: 2024-04-26

Last update: 2024-04-05

4 Last update: 2024-04-05

HeLI-OTS 2.0

heli-ots-v2

Persistent Identifier of this resource:

http://urn.fi/urn:nbn:fi:lb-2024040301

Access location: http://urn.fi/urn:nbn:fi:lb-2024040302

HeLI off-the-shelf language identifier with language models for 220 languages.

# Performance

It can identify c. 600-1700 sentences (averaging c. 150 characters) per second from a file using one core and around 4,3 gigabytes of memory on a modern laptop.

# Requirements

Java
The software has been created and tested on MacOS and Windows 11.

# Setting up HeLI-OTS

The github repository does not include pre-compiled version of HeLI.jar. The .jar file can be downloaded from:
https://zenodo.org/doi/10.5281/zenodo.4780897

Note that you need Java Developement Kit 'JDK' in order to create .jar files. Java Runtime Enviroment 'JRE' does NOT include jar program.

The HeLI.jar can be created from command-line within the src folder using:
```
jar cmf HeLI.mf HeLI.jar HeLI.class HeLI.java languagelist LanguageModels confidenceThresholds
```

# Command line use

In order to use the language identifier, you need only to download the HeLI.jar file which is used as follows.

These examples are for the jar file. The program can be used directly as the GitHub version by leaving out ```-jar``` and ```.jar```.

Please note that loading the language models takes the same amount of time (up to one minute) regardless of the size of the text file.

Usage:
```
java -jar HeLI.jar -r <infile> -w <outfile>
```

The program will read the <infile> and classify the language of each line as one of the 220 languages it knows
and writes the results, one ISO 639-3 code per line, into file <outfile>.

You can use the -c option to make the program print a confidence score for the identification after each language code.
The lower the 'confidence score' the more sure the identification is.

Usage:
```
java -jar HeLI.jar -c -r <infile> -w <outfile>
```

You can give the list of comma-separated ISO 639-3 identifiers for relevant languages after -l option.

Usage:
```
java -jar HeLI.jar -r <infile> -w <outfile> -l fin,swe,eng
```

You can give the number of top-scored languages to print after the -t option.

Usage:
```
java -jar HeLI.jar -r <infile> -w <outfile> -l fin,swe,eng -t 2
```

You can activate language set identification by -s. If a row contains longer passages in multiple languages, all the detected languages in the row will be returned. You must give the maximum number of resulting languages after -s option. (overrides confidence and printing several top-scored languages)

Usage:
```
java -jar HeLI.jar -r <infile> -w <outfile> -l fin,swe,eng -s 2
```

If you omit both of the filenames, the program will read the standard input one line at a time and write the result to standard output.

# Citations

If you use this program in producing scientific publications, please refer to:
```
@inproceedings{heliots2022,
title = "{H}e{LI-OTS}, Off-the-shelf Language Identifier for Text",
author = "Jauhiainen, Tommi and
Jauhiainen, Heidi and
Lind{\'e}n, Krister",
booktitle = "Proceedings of the 13th Conference on Language Resources and Evaluation",
month = june,
year = "2022",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.416.pdf",
pages = "3912--3922",
language = "English",
}
```
HeLI-OTS-2.0.zip includes the complete source code for the software.

# Acknowledgements

Producing and publishing this software has been partly supported by The Finnish Research Impact Foundation Tandem Industry Academia -funding in cooperation with Lingsoft, by the Kone Foundation, by the European Union – NextGenerationEU instrument, and by the Research Council of Finland under grant number 358720 (FIN-CLARIAH – Developing a Common RI for CLARIAH Finland).

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Unrestricted Use

Licence

Apache Licence 2.0

Download location: hidden

Distribution Access/Medium: Downloadable

Licensors:

University of Helsinki

Distribution rights holders:

University of Helsinki

IPR Holder

Heidi Jauhiainen

Tommi Jauhiainen

University of Helsinki

Contact Person

User support FIN-CLARIN

toolService

Tool Language identifier

Language Independent

Input

Media type: Text

Modality: Written Language

Character encoding: UTF - 8

Output

Media type: Text

Resource type: Language Description

Modality: Written Language

Character encoding: UTF - 8

Tagset: ISO 639-3

Operation

Operating system: Os - Independent

Running environment details: Java

Evaluation

Evaluated: False

Creation

Programming language: Java

Resource Creation

Resource Creator

Tommi Jauhiainen

Santtu Valosaari

Heidi Jauhiainen

Metadata

Created: 06/02/2021

Last Updated: 04/03/2024

Metadata Language: English (en)

Metadata Creator

Version

Version: 2

Usage

Actual Use - Nlp Applications

Use NLP Specific: Language Identification

Details: Usage: java -jar HeLI.jar -r <infile> -w <outfile> The program will read the <infile> and classify the language of each line as one of the 220 languages it knows and writes the results, one ISO 639-3 code per line, into file <outfile>.

Relation

Related Resource: HeLI-OTS 1.5 http://urn.fi/urn:nb...

Relation Type: IsNewVersionOf

Documentation

Resource group page: http://urn.fi/urn:nb...

People who looked at this resource also viewed the following:

Resources from the same creators