Amazon Research Introduces MTGenEval: A New Benchmark for Evaluating Gender Bias in Machine Translation

It has long been a goal in computer science to develop software capable of translating written text between languages. In the last decade, machine translation has become a practical and widely used productivity tool. As their popularity grows, it becomes increasingly important to verify that they are objective, fair and truthful.

Assessing the effectiveness of systems in terms of gender and quality is challenging, as existing benchmarks do not recognize differences in gender phenomena (e.g., focusing on occupations), sentence structures (e.g., using templates to generate sentences), or have language coverage.

To that end, a new work from Amazon introduces MTGenEval, a new benchmark for assessing gender bias in machine translation. Comprehensive and realistic, the MT-GenEval evaluation set supports translation from English into eight widely used (but sometimes underexplored) languages: Arabic, French, German, Hindi, Italian, Portuguese, Russian and Spanish. The benchmark provides 2,400 parallel phrases for training and development and 1,150 scoring data segments per language pair.

Meet Hailo-8™: An AI Processor Using Computer Vision for Multi-Camera, Multi-Person Recognition (Sponsored)

MTGenEval is well balanced thanks to the inclusion of human-created gender counterfactuals that give it realism and variety alongside a wide range of disambiguation settings.

In general, the test sets are artificially generated, which involves strong biases. In contrast, MT-GenEval data is based on real-world Wikipedia data and includes professionally prepared reference translations in each language.

Learning how gender is expressed in multiple languages ​​can help you identify common areas where translations fail. It is true that some English terms like “she” (female) or “brother” have no room for ambiguity when it comes to describing their gender (male gender). Nouns, adjectives, verbs and other parts of speech can be gender tagged in many languages, including those included in MT-GenEval.

READ :  Algorithms predict sports teams' moves with 80% accuracy -- ScienceDaily

A machine translation model must not only translate, but also accurately express the gender of genderless words in the input when translating from a gender-extensive or gender-restricted language (like English) to a gender-extensive language (like Spanish).

In practice, however, input texts are seldom that simple, and the term that makes a person’s gender unambiguous can be quite remote, perhaps even in a different expression, from the words that represent the gender in the translation. We found that machine translation models tend to rely on gender bias (e.g. translating “beautiful” as female and “handsome” as male, regardless of context) when faced with ambiguity in these situations .

While there have been isolated cases where translations did not accurately reflect intended gender, there has been no way to statistically assess these occurrences in actual, complicated input text.

The researchers searched English Wikipedia articles for possible text segments that contained at least one gender-specific word within a three-sentence range. To ensure that the segments were useful for assessing gender accuracy, human commenters removed all sentences that did not specifically refer to people.

The annotators then created counterfactuals for the segments in which the gender of the participants was changed from female to male or from male to female to ensure gender parity in the test set.

Each segment in the test set has both a correct translation with the correct genders and a contrastive translation that differs from the correct translation only in gendered terms, allowing for an assessment accuracy of the gendered translation. This study introduces a simple measure of accuracy that accounts for all gendered words in the contrasting reference for a given translation with the desired gender. The translation is flagged as inaccurate if it contains any of the gendered words from the contrastive reference, and correct otherwise. Their result shows that their automatic metric was reasonably consistent with that of human commenters with F-scores over 80% in each of the eight target languages.

READ :  Ekbalpur Police Station vandalized by the mob? BJP Complains of foul play.

In addition to this linguistic score, the team is also developing a metric to compare machine translation quality between male and female results. These gender quality differences are measured by comparing the BLEU values ​​of male and female samples from the same balanced dataset.

MT-GenEval is a significant improvement over previous methods of assessing machine translation accuracy for gender thanks to its extensive curation and annotation. The team hopes their work will encourage other academics to focus on improving gender translation accuracy for complicated real-world inputs in different languages.


Try this paper and Amazon blog. All credit for this research goes to the researchers on this project. Also don’t forget to participate our Reddit page and Discord Channelwhere we share the latest AI research news, cool AI projects and more.


Tanushree Shenwai is a Consulting Intern at MarktechPost. She is currently pursuing her B.Tech from Indian Institute of Technology (IIT), Bhubaneswar. She is a data science enthusiast and is very interested in the application areas of artificial intelligence in various fields. She is passionate about exploring new technological advances and their application in real life.


Leave a Reply

Your email address will not be published. Required fields are marked *