The Petabyte Corporation

Fred Hapgood

Suppose you come into work one day and find a letter from corporate counsel advising you that a court has just held that corporations are now responsible -- that means you -- for retaining all phone conversations pertaining to business for a year. Under that is a memo from PR (labeled 'urgent') advising that marketing's plan to use face recognition to identify prime customers entering the store for special attention risks adverse press fallout. And under that is a directive from management asking for a technical analysis of a market simulator that could predict the response to a proposed campaign from single parents living in suburbs with incomes in the third decile to within 5%.

Who are you? You're a petabyte CIO, the person responsible for developing and maintaining the SF-like applications that will be running on tomorrow's immense storage capacities. Currently petabyte responsibilities are mostly restricted to the IT departments of research organizations -- telescopes, molecular genetics labs, particle accelerators. But the law of technological adoption (if you build it, they will come) predicts that sooner or later most CIOs will be crossing that line, discovering a new world of applications, responsibilities, and costs.

In the near term petabyte levels of storage make possible three new categories of applications. One group depends on retaining and processing large amounts of visual data, especially data from videocams. The flow of traffic captured by cameras trained on the display floor might be processing to measure the effectiveness of floor displays. Marketing might be interested in learning how the proportion of couples entering the store changes during a campaign. Human resources might be able to flag personnel responsible for traffic bottlenecks.

A second is supporting the transition to device networks. If the first generation of networks connected people to data and each other, the second will do the same with physical (counters, meters, cameras, motors, switches, telephones, digital printers) and virtual (applications and program objects) devices. The great virtue of device networks is that they allow remote access of any interested constituency to any link in the production cycle. CNN is digitizing and networking all their production equipment so their specialized format teams -- pagers, cells, PDAs, websites -- will have equal and simultaneous access to programming. In many industries even today machines keep maintenance informed about their operating condition, allowing them to be repaired just before they are about to fail (call it 'Minority Report maintenance'). When production is fully networked, management will be able to switch an entire production process to a single desired configuration from half a world away. However, according to John Parkinson, CTO of Cap Gemini Ernst and Young in New York City, properly managing the thousands of sensor- actuator loops that form device networks usually requires retaining the history of their states, often in raw, i.e., unsummarized, form, for months. Petabyte storage is the natural infrastructure for such networks.

Finally, petabyte levels of storage would allow simulations and predictive models of enormous complexity. For instance, currently retail managers worry about persuading casual visitors, someone who arrives at the site or store thinking they are "just looking", to make a purchase. While this is an important problem, another is turning a casual buyer into a loyal, recurring, one. Over the long- term this second kind of conversion can deliver even more weight to revenues.

However, as Richard Winter of The Winter Corp., a Waltham MA consultancy specializing in the architecture of very large databases, points out, brand loyalty does not take root overnight. "Guiding someone into becoming a repeat customer means presenting him or her with just the right information or opportunity at the right time," he says. "Knowing what to present can mean retaining huge amounts of information on that customer -- what they've looked at, checked prices on, asked about, what they've not looked at -- over long periods. Often the relationship needs to be followed from the point the customer first appeared. That is impossible now, since raw, unsummarized, clickstream and transaction data is generally discarded after 30- 60 days." Petabyte levels of storage will lift that constraint.

In other words, it is not hard to think of useful and interesting reasons why management might want to ask IT to take the company through the petabyte door. The next issue is finding ways to execute that request without getting fired. If you assume storage-related costs (especially the time penalties) scale linearly, then petabyte management increases the headaches associated with a terabyte of data as an eight story building is to an inch.

This is depressing enough, but the news gets worse. Searches conducted on larger volumes of data naturally generate more errors. At some point these error volumes so overwhelm the user's ability to cope that he or she can no longer use the system at all. The only solution is to rewrite the search programs so they make fewer errors, and no IT development task is harder to do predictably than boosting the IQ of computer programs. Finally, according to CGEY's Parkinson, even the costs of the core overhead tasks (like buffer management) typically grow faster than linearly.

One body of opinion believes the transition is just not worth it. Faisal Shah, CTO of systems integrator Knightsbridge Solutions LLC, points out that data quality naturally drifts down as more space opens up in the corporate attic, in part because you are now saving everything you used to throw away: meeting notes, drafts, video files, unsuccessful bids. Shah believes that companies will be better off spending the marginal IT dollar on trying to extract more intelligence from current data stores rather than piling up haystacks with fewer and fewer needles hidden in them.

Other observers are betting on new technologies to keep these penalties under control. Like many IT problems, the solutions being explored fall all along the spectrum of centralized to distributed. Ron Davis, Senior IT Architect of Equifax, the consumer data company in Atlanta, is working with a centralized solution from Corworks of Stamford, Conn. Equifax makes information products from raw data bought from suppliers such as state agencies or directory companies. The company prefers to control this data for as long as possible, since it never knows what a new product design might call for or when. While in theory historical data could be left with the suppliers, Davis' experience is that retention policies and practices vary too widely over Equifax' 14,000 suppliers to make such dependence practical. He believes that at least over the short run companies near the end of the value chain will have to take on the responsibility of archiving raw data. Shouldering this responsibility has put Equifax on track to become a petabyte company, and forced Davis to search for an data architecture competent to deal with the problems mentioned above.

Corworks' basic idea is to beat the time penalties inherent in handling large volumes by loading it all into electronic memory. This seems counter-intuitive, rather like making a quart easier to drink by squeezing it into a pint, but the feat is done by stripping out the structural data (i.e., converting everything into flat files), compressing the result, and then relying on fast processors to decompress and restore the data structures only as needed -- just-in- time logic. "I have a 67 billion row table," Davis said, "and I can do a sort across six months of that table in three seconds." Backing up and restoring become easier for the same reasons.

A second approach to leveraging the speed of electronic memory is to build algorithms that can grade data by importance. The most critical pieces get loaded into memory, while the rest goes to disk-based systems where lower performance levels and therefore lower operating costs) are tolerable. Dave Harley, Chief Designer of British Telecom, is experimenting with this approach with software from Princeton Softtech. While to date the approach has only been used in asset management and fault tracking, results have been such that Harley expects to see this so- called "active archiving" adopted throughout the company. "The key factor is keeping the most critical data base as small as possible," he says. "It's quite a new idea."

StorageNetworks, of Waltham MA, is also using this approach to manage the 1.5 petabytes acquired through its storage services arm. Peter Bell, the company's CEO, says that 70% of the data stored on an average system has not been looked at in the previous 90 days. If you make the reasonable assumption that number of recent accesses is a dependable proxy for enterprise criticality, then loading just the most used data into memory can go a long way to delivering acceptable performance where it is needed. Bell adds that the critical issue in managing petabyte- scale volumes of data is developing data classification systems that balances power without introducing excessive single point-of- failure risks. (If a computer managing a petabyte goes bad, the damage it can cause is breathtaking.)

On the other hand, Len Cavers, Director of Technical Development at Britain's Experian (Nottingham, U.K.), a competitor of Equifax', believes that in the long run centralized solutions will not scale adequately. He argues that as backbone bandwidth speeds increase and data standards get defined and distributed companies like his will find it increasingly practical to "leave the data" higher and higher up the value chain. In that world, the networks would carry not raw data (which wouldn't move) but queries and intelligent indexes, so that querying systems know who to connect to). Experian is now involved with an active development program with its partners over how to use XML to frame and respond to queries and generate indexes. Cavers believes that petabyte- level data stores will force IT people into minimizing the number of mass copy operations as much as possible. "This is a paradigm shift in the way people think about computing," he says.

Gerry Higgins, Senior VP of information processing at Verizon, points out that maintaining a petabyte of data raises distribution management issues in hardware as well as software. In the petabyte world data is usually spread over thousands, even tens of thousands, of disks. "Vendors always want to talk to me about how great their Mean- Time- Between-Failure numbers are. I tell them not to bother -- all I'm interested in is what happens when there is a failure. When you deal with so many disks some are always crashing. I tell them that when you're a petabyte guy like me, you have to expect failures."

Many observers think the transition into petabyte levels is going to introduce changes even more sweeping than those associated with previous leaps in storage. "Traditionally vendors have built stand- alone datamining engines and moved the data into them," says Richard Winter. "But are you going to be able to move a petabyte around like that?" Winter foresees radical changes in engine architecture, probably involving breakthroughs in the engineering of parallelization. (He cites the work of Ab Initio of Lexington, MA as illustrating the trend.)

"The whole notion of storage takes on a new meaning," says Scot Klimke CIO for Network Appliances, a storage services vendor in Sunnyvale. "It starts to be defined less as simple retention and more as the struggle for information quality." Perhaps the worst such issue is consistency. A petabyte of data is so big, and the quality of the information it contains is so low, that it is bound to contain and create inconsistent information, which means that any petabyte-level system has to contain ways of detecting and resolving data conflicts. Another issue is aging: information quality varies (roughly) with age, but present systems track the age of material, especially material within a file, poorly. "I have 5 priorities for this fiscal year," Klimke says. "Two involve data quality. Both of those projects warrant executive steering committee oversight."

Klimke argues that as the petabyte revolution unrolls the struggle to measure and manage data quality will increasingly define the job of being a CIO. While he might or might not be right about this specific point, it seems likely anyone immigrating into the petabyte world is going to have to learn a lot of new habits.