Three Ways to Quickly Ballpark a Data Warehouse Build Schedule
When a seemingly impossible question comes down from "on high," you can't always say no. Here are a few simple approaches to providing an off-the-cuff delivery estimate for a new data warehousing project, even with limited information and a tight time frame.
- By Cass Brewer
- April 21, 2016
It's Wednesday morning and you have mail:
Hey there! I'm pitching a data warehouse to the Technology Planning Council on Monday and I need a ballpark figure for how fast we can get that in place. Sheila says you're the go-to guy for database questions. Can you get that to me by Friday? -- Syd, VP of your world
This is good news, right? You're a big fan of data warehousing. On the other hand, can you actually whip up a realistic time estimate in just a couple of days? Short of Googling "data warehouse typical schedule," where do you even start?
Option 1: The "Nice Work If You Can Get It" Estimate
Conventional project estimation favors both top-down (estimating how long the big steps will take) and bottom-up (compositing tight timelines from small work units) approaches. However, both approaches typically require time and research. They might not be practical in a pinch unless you are lucky or skilled enough to have a big head start.
If you have access to records from comparable projects, you can try to pull together an analogous estimate. Look for technology projects in your company's history that have had comparable complexity, uncertainty, resource requirements, and impact. Hopefully, you'll be able to find some activity and scheduling records that can springboard your own estimate.
Alternately, if you have a lot of experience with similar projects in other companies (or know someone who does), you might tap your own judgment to broadly estimate the time needed for various steps, such as investigation, design, development, ETL, testing, and deployment.
In either case, you can use internally available or published templates and checklists to quickly identify and sequence the major components of a "typical" data warehouse implementation. A high-level outline provides a baseline for analogous estimates or a list of building blocks for judgment-based estimation. Here are a few published examples:
Option 2: The "Quick and Dirty Data" Estimate
This approach is more operationally based in that it requires only good working knowledge of the technical environment and focuses on build time.
First, arbitrarily rate the complexity of each source data system (including external feeds and spreadsheets) as low, medium, or high based on:
- Number of fact tables and dimensions
- Volume of data and rate of change
- Associated reporting requirements
Score each "low" system as 3, each "medium" system as 9, and each "complex" system as 15. These scores represent worker-days required to include the system in a data warehouse, accounting for design, ETL, and testing. You can adjust these numbers if you have a different sense of the actual effort required.
Systems that are poorly understood and external feeds (which might not be transparent) likely deserve higher scores. Sum all scores into a single number.
Next, add a margin for unknowns. Divide your total days by number of systems. This will give you an average day count per system. If the average is less than 6, multiply the total by .2. If it's between 6 and 12, multiply by .3. If it's over 12, multiply by .4. These fractional numbers represent uncertainty in your estimate, based on system complexity.
Add the product of your uncertainty calculation to the original day count to get a total time estimate that assumes one full-time resource. To reflect work weeks or additional resources, you can either divide the total by 5 or the number of resources, respectively.
Example:
Source |
Score |
Calculation |
System 1 |
3 |
|
System 2 |
15 |
|
System 3 |
15 |
|
System 4 |
9 |
|
Total Score |
39 |
=3+15+15+9
|
|
|
|
Average |
9.75 |
=39/4 |
Adjustment Factor: |
.3 |
|
Adjusted Total |
11.7 |
=.3*39 |
Total days |
50.7 |
=39+11.7 |
|
|
|
Two people: |
25.4 days |
|
Total weeks: |
10.1 |
=50.7/5 |
Option 3: The "Hail Mary" Estimate
If you know only the database structure and you don't know how the data is used or how it flows, this approach will give you a rough estimate that at least reflects data complexity. The calculation assumes that sorting out data tables and their relationships represents the bulk of the work in a data warehousing project, without regard for source systems or how data is used.
Using this method, a project that has few source systems but thousands of tables will yield a similar estimate to one with many systems and few tables. Obviously, there are some potential problems with this assumption; nevertheless, it will give you an estimate that's closer to realistic than random.
First, count the number of fact tables and spreadsheets (including external feeds). Multiply the total by 3 (days to complete design, data quality/ETL, testing, and basic reporting). At the end, add one week per source for analysis.
Example:
Sources: |
4 |
|
Total tables: |
80 |
|
Total days: |
240 |
=80*3 |
|
|
|
Analysis |
20 |
=4*5 |
Total days: |
260 |
=240+20 |
|
|
|
Total weeks: |
51 |
=260/5
|
Try, But Don't Be a Hero
Our hypothetical scenario happens more often than you might think. Business executives are understandably attracted to the vision of data-driven decisions and global analytics capabilities. They want it perfect and yesterday, but have no real idea of what's involved in making it happen -- or even what's involved in figuring out what's involved. Thus, they schedule a meeting and ping the "data guy" for "a few quick numbers." This is what we in the biz call a high-risk scenario where failure seems imminent.
When you can't say "no" to an impractical request, you have a couple of choices: kill yourself trying or make a reasonable effort and address the gaps in another way. Consider, for example, our "Quick and Dirty" and "Hail Mary" estimates, which cover only technical execution. If this is all Syd wants, you can stop there.
However, it's possible the people evaluating the system are expecting a more programmatic estimate that factors in strategic efforts like requirements gathering, project initiation, planning, semantic harmonization, procurement, BI software, and user training. Thus, when you send Syd your answer, it's important to be clear about what it does (and doesn't) cover, how much confidence you have in its accuracy, and what you would need to increase that confidence level.
About the Author
Cass Brewer has more than 15 years of experience as a technology analyst and project manager, focusing on governance and risk management. She is the founder of Truth to Power research community and led research at the IT Compliance institute, amid other roles in business and software analysis, management, and communications. Contact her at [email protected] with your comments and to suggest topics for future articles.