TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
  - TDWI Playbook | Next Generation Data Science: The AI-Driven Data Science Life Cycle
  - TDWI Data Points | The Data Foundation for AI
  - TDWI Best Practices Report | Data Strategies and Foundations for Modern Data Management
  - TDWI Insight Accelerator | Adopting a Platform Approach for Gaining Insights from Unstructured Data
- Webinars
  - Expert Panel: What's Next in Data Integration: Powering the AI-Driven Enterprise August 25, 2025
  - Expert Panel: Improving Data Quality, Accuracy, and Consistency August 27, 2025
  - Architecting a Modern Martech Stack for Speed, Scale, and AI Readiness August 28, 2025
  - The State of Self-Service Analytics: Results from TDWI’s Latest Research September 8, 2025
- Virtual Summits
  - Virtual Events Keys to Making Your Data AI Ready September 10, 2025
  - Virtual Events Data Quality for BI, Analytics and AI October 22, 2025
  - Virtual Events Modern Data Strategy November 12, 2025
  - Virtual Events What’s Ahead in 2026 for Data & Analytics December 10, 2025
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Speaking of Data Podcast
  
  Current Research Surveys
Train
- In-Person Events
  - Conference TDWI Transform 2025 San Diego August 18, 2025
  - Executive Summit TDWI Modern Data Leader's Summit San Diego: AI in the Enterprise August 18, 2025
  - Conference TDWI Transform 2025 Orlando November 16, 2025
  - Executive Summit TDWI Data & AI Leaders Summit Orlando: Governing Data, Analytics, and AI November 17, 2025
- Virtual Live Seminars
  - Data Governance Week July 30, 2025
  - Platforms & Architecture Week July 30, 2025
  - AI Bootcamp Week July 30, 2025
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  Get CertifiedEarn a professional credential in BI and Analytics, Data Governance, or AI
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Become a Member of the Data & AI Leaders Forum Engage in collaborative discussions, stay ahead of the curve, and stay in the know. Apply Now
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

RESEARCH & RESOURCES

LESSON - Company Directory Information and Customer Records: The Schibsted Case

October 23, 2006

By Per Anders Waaler, Senior Software Architect, and Davor Sutija, Vice President, Strategic Market Development, FAST

Introduction

All companies of a certain size or complexity face the same challenge: Vital information about the business is stored in multiple locations and in different formats. This goes for customer data as well as product data. Getting that single version of the truth about particular customer earnings or product groups becomes quite an effort. Thus enters the data warehouse.

Data warehouses, however, do not solve one challenge: that of merging data from disparate sources without common keys. Customer data is typically stored in several systems: CRM, ERP, call centers, and so on. For a company that has recently undergone a merger or acquisition, the challenge only increases.

Schibsted Search and Sesam.no

Imagine the challenges involved in being an Internet directory provider. In Norway, the base for Schibsted, a large publishing company, there are more than 20 different providers of subscription services. All telecommunications operators are obligated by law in this particular market to deliver at their own cost a list of subscribers to third parties that request such information. This legal requirement, however, does not mention nor guarantee data quality. As a publishing company, all such telecom record sources for Schibsted were external. Because of this, Sesam.no, Schibsted’s online local search and yellow pages/white pages service, probably faced even more challenges than most in merging their databases.

Schibsted Search’s goal was to present on the Web (at Sesam.no) all listings for a given company or person in a single GUI block, combining and cleansing multiple records. But even when the underlying telecom providers had the correct names for their subscribers, abbreviations and contractions abounded. For example, our own company, officially registered as Fast Search & Transfer, does business under the trade name FAST. Similar inconsistencies are especially prevalent in restaurant, retail, and other companies where a public trademark is not the name under which the company was registered.

By using multiple secondary and tertiary data fields, such as addresses and phone numbers, the link between the trade name and the corporate identity can be established. In addition, spelling errors and variations in word order that occur during data entry in call centers or online can be corrected automatically before publication using analogous tools.

The answer to our data quality issues, not only in data warehousing, but in all other aspects, is to build solutions using new tools such as the FAST Data Cleansing Solution to solve that issue. Regular database and programming logic could not do it for us.

Mikal Rohde, Schibsted Search

The Solution

Schibsted had already standardized on FAST as their search platform on the end user side of the Web, offloading database technologies that could not cope with the increasing traffic and ad hoc query load that an online enterprise generates, both in the number of users and the query complexity.

When we told Schibsted that they could also use the same technology for data cleansing, they were intrigued. As part of the FAST Data Cleansing Solution, adapters expose the FAST search engine as a database within leading ETL tools such as BusinessObjects Data Integrator and Informatica (when the complexity of the source data requires such a tool). The solution also includes embedded tools that can access popular RDBMS systems directly. In addition, FAST Data Cleansing can be called through custom APIs from any Java-enabled tool (TIBCO, webMethods, MQSeries, Oracle, and so on).

With this solution in place, we populated the FAST index from an arbitrarily large set of data sources, both structured and unstructured. Schibsted was interested in matching relevant directory listings to news stories and products using FAST’s unique relevancy ranking tools.

The steps in the cleansing operation are straightforward: a data source is first indexed in FAST, and then additional records/sources are matched one-to-many using lookups in the FAST index. This process not only executes strict matches obtained by running queries against a database, but more broadly, uses various fuzzy matching algorithms. This approach, which reduces words to their core and then matches all forms of that core word, delivers comparative results ranked according to relevance and user-defined rule sets. All of this can be done easily within an ETL tool framework, using the adapter or specialized cleansing GUIs.

Data Cleansing Operation

Read sources
Track source changes
Cleanse new/changed data against master
Store result in database
Allow for manual inspection GUI for unclear matches
Update master and searchable apps with cleansed info

Different Scenarios

There are two basic scenarios that organizations face when attempting to merge data from multiple sources. In the first scenario, there is a master to cleanse against. This is easiest, but not the most common. In this scenario, we populate the FAST index with the master, and then cleanse other sources against that master, expecting to get a clear candidate for matching.

Scenario two is more typical. Here, there is no clear master, but instead a group of sources (such as a set of customer databases) that you want to cleanse against each other. The purpose is to find the same entity across disparate sources. In this scenario, we populate the master with one of the databases (chosen arbitrarily) and then cleanse that same set against itself to eliminate duplicates. Once any duplicates are removed, the cleansed set of database records acts as a master for further cleansing against the other repositories.

Cleansing Company Information

A typical scenario at Sesam.no was the cleansing of company information. This was accomplished using scenario one as described above. Since Norway provides a full listing and updates of all registered companies, this list can be used as the master to cleanse against. In the Oslo region alone, a metropolitan area of one million people, there are some 60,000 registered companies. Cleansing in most cases is done on a combination of company name, street, zip code, and municipality.

Take a look at the sample case described below. With a potential of 60,000 companies, given a misspelled company name and the city, we are able to find the correct entity. The solution handles other typical quality issues equally well, such as words switching place (first/last name versus last/first name), synonyms, and lemmatization (recognition and matching of different grammatical forms).

Sample case

Company name: beWise as (this is actually a misspelled version of bWise as)

Street: Not provided

Zip: Not provided

City: Oslo

Having populated the master index with all the 60,000 companies in Oslo,the (partial) result we get back as a candidate list using the fuzzy matching is:

RANK	Company	Street	Zip	City	Cleansed
1	BWISE AS	RADHUSGATA 9	151	OSLO	1
2	STREETWISE AS	LOKKEVEIEN 5	253	OSLO	0
3	MACWISE AS	ULLERNCHAUSSEEN 119	284	OSLO	0

The cleansed flag is set based on thresholds given to the fuzzy lookup function.

When Schibsted and Sesam.no were awarded the Norwegian Data Warehouse Award for 2006, Mikal Rohde, CEO of Schibsted Search, said: “The answer to our data quality issues, not only in data warehousing, but in all other aspects, is to build solutions using new tools such as the FAST Data Cleansing Solution to solve that issue. Regular database and programming logic could not do it for us.”

The jury of the Data Warehouse Conference, hosted by Confex Norway, stated in their evaluation that the differentiating factor for the winning entry was their solution to the problem of data quality, which allowed their data warehouse to drive their entire online business.

This article originally appeared in the issue of .

TDWI Membership

Get immediate access to training discounts, video library, research, and more.

Find the right level of Membership for you.

Learn More

↑

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning