Slicing & dicing content quality methods into convenient bite-sized pieces

4 ways for global content experts to slice & dice quality

A tool for slicing & dicing content quality methods into convenient bite-sized pieces
A tool for slicing & dicing content quality methods into convenient bite-sized pieces

Today, we are venturing into a more technical territory than usual. This article will likely be most interesting to globalization experts, as well as those specifically looking to deepen their knowledge of applied global content quality management. What happens to quality evaluations and how do they transform as we go into the nitty-gritty of multilingual content quality management frameworks? How do we make sense of these frameworks and their interrelations? How can we come back again to the level where it all makes sense to the people un-initiated in localization quality matters (that is, to 99.9% of the world population), and talk to them in their language?

Over the years, localization industry has come up with several well-structured methodologies to define, categorize, and measure various aspects of quality related to multilingual translated content. Notable frameworks in this space include TAUS Dynamic Quality Framework/DQF, Multidimensional Quality Metrics/MQM, and Logrus Quality Triangle. However, it’s easy to lose sight of the forest behind the individual trees. How do those frameworks fit together? How do established localization industry processes relate to those frameworks? What connection do all of the above have to the Ultimate Content Quality Question: does our global content actually impact the desired business KPIs?

Here is a method we might use to further structure and refine our thinking about some of the global content quality management approaches, processes, methods, and techniques. It relies on 4 categories:

  1. Contextuality*: bilingual versus monolingual4 axes for classifying content quality evaluation techniques
  2. Technology: machine versus human
  3. Expertise: untrained versus professional
  4. Granularity: atomistic versus holistic

Note that these categories are not mutually exclusive. That is, each quality method or technique actually belongs to all 4 categories at once. The only question is just where exactly it is inside each category. If we imagine each as a horizontal scale, is it close to the left side or closer to the right side? You’ll find details & examples below.

Those of us who are mathematically inclined may want to think of each category as an individual dimension in a 4-dimensional space. A four-coordinate vector (c, t, e, g) will then represent a position of a technique or method within that space.

Normal people, on the other hand, should just read on. I promise it will all make sense eventually 🙂

Contextuality*: Bilingual versus Monolingual

*OK, I admit: I’ve just made this word up. My text editor underlines it in bright red as I’m typing this article. Would welcome your suggestions on how to call it.

This spectrum is about the volume and nature of information that’s taken into account when making a judgment about content quality.

On the monolingual end of the spectrum, we consider just the content itself, in the language it appears in at the moment of evaluation. Proofreading for spelling & grammar mistakes is a good example of a monolingual process.

On the bilingual end of the spectrum, we can also refer to the original version of the content in its native language (the all-mighty source!) and are able to compare the translation with the source. Editing for translation accuracy is a good example of a bilingual process.

Technology: Human versus Machine

This spectrum is, surprisingly, exactly what it sounds like.

Some types of content quality evaluations or assessments are produced by actual people working with your content (for instance, stylistic copyediting, revision, end user feedback, or usability testing).

Others are produced by software that analyzes your content (for instance, MT quality metrics, automatic translation quality checks, readability statistics, or sentiment analysis).

There are also certain quality procedures that may be performed either way, with varying efficiency, reliability, and costs (for example, manual vs automated software localization testing or manual vs automated spelling and grammar checks).

Expertise: Untrained versus Professional

This spectrum is applicable mostly to the human end of the above Technology spectrum. However, with some creativity one can find a way to apply it to machines as well. For the mathematically inclined among us, let’s agree we’re leaving this as an exercise to the reader 🙂

Here, on one side, we have methods relying on dedicated, educated, well-trained evaluators (for example, Error Typology reviews where professional linguists, typically after undergoing extra training, classify atomistic issues according to some metric).

On the other side, we find approaches that rely on evaluators without any particular expertise or training (for example, crowdsourced quality evaluation methods or votes, as well as acceptance testing techniques and end-user feedback). In a global setting, it’s usually implied those individuals possess language skills.

Granularity: Atomistic versus Holistic

This spectrum is the most developed and the most popular in the global content quality management domain. Feel free to skip to Content Performance Metrics subsection if you are very familiar with the atomistic vs holistic dichotomy.

Atomistic Quality

On the atomistic end of this spectrum, we operate at the microscopic level of “content atoms”. That is, individual sentences, words, and characters that make up a piece of content in a particular language.

  • The negative impact of any quality issues on this level is usually limited to the confines of the sentence (or, at most, the paragraph) where the issue has occurred.
    • An important exception to the above rule is showstopper issues, best captured by the Logrus Quality Square model. Showstopper errors actually impact the holistic level (see below), despite being atomistic by nature.
  • Example process steps operating mostly on this level:
    • Language Quality Inspection/Error Typology reviews (e.g. using MQM taxonomy, DQF, older models like LISA QA Model, or arbitrary translation error taxonomies)
    • proofreading for spelling, grammar, and style
    • editing (certain types)
    • software localization testing (in many occasions)
    • Machine Translation quality metrics, e.g. BLEU and METEOR
    • DTP QA
  • How people might talk about this level:
    • “This translated term doesn’t fit the context of this sentence.”
    • “This comma is not needed here”
    • “Completely ignored the grammar. Should be future tense, not the past”
    • “The word A was mistranslated as B”

Holistic Quality

On the spectrum’s opposite, holistic end, we operate with overall perceptions and impressions that a piece of content as a whole has on the end reader or end user.

  • Holistic level relates to people acquiring desired knowledge, performing desired actions, or changing their attitudes in the desired way after coming in touch with your content.
  • In other words, this is all about user and reader experience, not the “atoms” that make it up. Think of it as a total that exceeds the sum of its individual parts.
  • Example process steps operating mostly on this level:
    • Accuracy (Adequacy) and Fluency reviews of the entire text (as opposed to individual sentences)
    • ratings (e.g. 1-5 stars)
    • end user feedback (in some cases)
    • in-country review (in some cases)
    • usability testing
    • “overall feedback” sections of Error Typology reviews
  • How people might talk about this level:
    • “This doesn’t sound like a native speaker.”
    • “Was this translated by a robot or something?”
    • “I don’t understand the point they are trying to make”
    • “So many errors I couldn’t find the right button to click and deleted the app in disgust”
    • “It. Just. Doesn’t. Work.”

Content Performance: the Pinnacle of Holistic Quality Evaluations

Out of all holistic metrics, some are actually “more holistic” than others. Content performance metrics evaluate the overall, ultimate success of global content in any given language. They provide a sense of whether the content has actually reached the desired outcomes for the business that has commissioned it, as well as for the people who have consumed it or interacted with it. Example include conversion rates, customer satisfaction, learning outcomes, Return on Investment, and many others.

Content performance metrics are, thus, among the most important ones to measure wherever technically possible and practically feasible. They also are rarely made available to the entire global content supply chain, lowering the transparency significantly. This lack of opacity should also be very concerning to all managers: the very people whose work strongly drives those key metrics (for instance, individual authors and linguists) often don’t have any access to this powerful form of feedback.

Note: While quality factors (such as atomistic and holistic quality) are not the only ones influencing content performance, they obviously play an important role in it and are best viewed in the same context.


Hopefully, this overview was helpful to structure some of your thoughts around the many methods in the localization quality evaluation toolkit. What made a lot of sense for you, and what didn’t? What do you agree with? What do you find controversial? We’d LOVE to read your comments!

Published by

Kirill Soloviev

Co-Founder & Head of Product at ContentQuo

  • Артём Ё

    As for contextuality… First of all, it may be more than “bilingual”. If we have a chain of languages in the localization process (RU => EN => whatever), for “whatever” target language we can take into account not only direct English source, but also Russian pre-source, so to say.
    Furthermore, when evaluating the quality of, let’s say, Frisian localized content, sometimes we can look at the corresponding content localized into other languages, and try to understand whether there is a localization error or maybe we deal with some more complicated issue, e.g. a new term in the source content that has no optimal (conventional) translation not only to Frisian but also to other languages yet.
    I won’t try to say how to organize all that 🙂 But at least when working with glossaries (not actually user-oriented content), I personally often try to keep in mind multi-lingual perspective of a term.

    • Артём Ё

      Now, about more interesting thing – about spreading this category (“contextuality”) not only in a quantitative manner (number of languages), bit also in a qualitative manner. “Unilateral vs. multilateral”?
      What I mean is that we can take into account different type of information (materials). When evaluating the quality of UA doc, we should compare the names of GUI elements with the corresponding GUI resources. And it’s not always so trivial like – “any deviation of UA doc from GUI resources is an error”. Even not speaking about mistypes in GUI resources, there is more interesting case: if the same source GUI element is localized in different ways in GUI resources and in UA doc, what if the option in UA doc is more adequate and authentic, than the option in GUI resources? Then the choice what to correct for the sake of overall content (product) quality can depend on workflow, etc.
      I hope you get what I’m trying to say and ideally – it relates more or less to the “contextuality” that you invented 🙂 Pls. let me know what you think.

      • Good thinking. Language is indeed only 1 dimension of the Contextuality space (although probably the most basic one), and your suggestions to extend it to more dimensions are indeed quite fitting. Now, if we only had a simple way to depict a 5-dimensional space on a flat screen… 🙂

  • Артём Ё

    Let me also touch upon another category – “Expertise”, since you’ve presented it as “an exercise to the reader”. BTW what about a prize? 🙂

    The terminology verification can be automatic, at least partly (“machine end of Technology spectrum”, in the context of this blog).
    In my work (localization of IT security software), I often apply common glossary (of IT terms) and product-specific glossary (of terms related to specific product elements, to third-party software compatible with the product, etc.)

    I have a daring idea – to relate common glossary to your “Untrained” and product-specific glossary to your “Professional”. I think it’ more or less clear, so I’ll give just 1 criterion:

    common glossary is a kind of compilation from many public sources, roughly speaking (although in reality its management is much harder and manifold, believe me…)

    vs.

    product-specific glossary must take into account architectural/functional information about the product, marketing factors, GUI design, etc.

    P.S. My opposition is manifested on the “human end of Technology spectrum” as well – to create a good product-specific glossary, you must be a real expert in the corresponding product. While “IT experts” – that’s practically about all of us (at least, who read this blog) in the modern world. Well, I’m not talking about terminology management expertise right now. Anyway, this P.S. doesn’t contradict my solution attempt for the exercise.

    • Artyom, a prize is indeed most appropriate! In appreciation of your continued contributions we’ve temporarily changed the title of this blog post and included your name as an honorable mention. Just scroll up and you’ll see it.

  • Pingback: Weekly translation favorites (Feb 5-11)()