By using website you agree to our use of cookies as described in our cookie policy. Learn More


New Off-the-Shelf Data Sets from Appen Available for AI Training

Data sets include scripted speech, images with text, body movement, and human audio.

Note: TDWI’s editors carefully choose vendor-issued press releases about new or upgraded products and services. We have edited and/or condensed this release to highlight key features but make no claims as to the accuracy of the vendor's statements.

Appen Ltd., a provider of training data for organizations that build AI systems, released new off-the-shelf (OTS) data sets. These data sets are designed to make it easier and faster for businesses to acquire the training data needed to accelerate their artificial intelligence (AI) and machine learning (ML) projects.

The new OTS data sets include human body movement and innovative baby crying sounds, as well as scripted speech and images with text suitable for optical character recognition (OCR) for high-demand but hard-to-acquire languages, such as Arabic, Croatian, Greek, Hungarian, Thai, and more. With the expanded data sets, Appen’s total OTS offering includes over 250 data sets, comprising over 11,000 hours of audio, over 25,000 images and over 8.7 million words across 80 languages and multiple dialects.

Teams expanding their AI capabilities can also leverage OTS data sets to effectively improve accuracy, develop new model skills, and incorporate other improvements into their AI models. An OTS data set is often delivered in one week, for example, compared to the eight to 12 weeks for a new data set collection and annotation project -- or even longer, depending on complexity. All Appen data sets are developed using a fully transparent, opt-in methodology, so AI specialists can be assured their data is clean and compliant, eliminating the potential risk of backlash and reputation damage.

“AI teams around the world working on projects with tight deadlines and flexible data requirements can benefit from using off-the-shelf data sets,” said Wilson Pang, CTO of Appen. “OTS data sets shorten time to value and provide access to high-quality data at a lower total cost than using traditional methods. We at Appen take the necessary steps to ensure that all our data sets are ethically sourced and demographically balanced, enabling companies to maintain responsible AI practices by minimizing bias in their models and ensuring fair treatment of data annotators. You always know the precise quality of an OTS data set, which helps build better AI that works in the real world.”

Joining the hundreds of data sets already live on, the list of new Appen OTS data sets that are now available includes:

  • Scripted speech for Arabic (Egypt), Arabic (Saudi Arabia), Arabic (United Arab Emirates), Central Khmer (Cambodia), Croatian, Greek, Hungarian, Polish, Spanish (Spain), and Turkish
  • Image OCR for Simplified Chinese printed text, Thai printed text, and Finnish printed text, including pre-recorded billboards, outer packaging, signs, magazines, and menus to train and update computer vision OCR models
  • Human body movement (China), including annotated videos of people moving, tracked at pixel level, suitable for game development, fitness apps, and more
  • Baby crying audio (China), including pre-recorded and annotated baby sounds that can be used to train AI models to recognize different crying sounds and alert parents

For details, visit

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

Find the right level of Membership for you.