Storage Monkeys Blogs

Rants and Raves from the community
Tim

There is no doubt that the various deduplication technologies address a specific set of problems. For the purposes of this blog post, I'm going to focus on backup deduplication.

When most companies look at backup deduplication, they are typically trying to solve one or more of the follow problems:

  1. Backup window - data growth is causing the backup process to bleed past the backup window and into production
  2. Retention on disk - either retaining some portion of backups on disk or growing the number of days data is being retained on disk without ridiculously large amounts of disk storage
  3. Replication - reduce the amount of data that needs to be replicated to a remote site thereby lowering the cost of bandwidth

IT departments looking for one of these solutions have identified a problem with their backup processes and are looking for a solution to fix the problem. Simple enough... but are these problems REALLY the issue or are they symptoms of a larger problem with the tradional backup?

Let's think about it - how much have IT departments changed in the last 20 years? Pretty dramatically when you think about it. One thing that has not changed is backup. The same backup software you are using today is not all that different from the backup software used 20 years ago. Functionally, you are taking daily, weekly, monthly and maybe yearly backups of your data. The process hasn't changed and for the most part, backup administrators haven't evolved much either. Ask yourself or your backup administrator why they are using backup software and the likely answer is "Well, that's what we've always done and it works just fine". If it's working for you, then you shouldn't need a deduplication solution, right?

The truth is that backup software is very outdated technology. We use it because we've always used it, not because it's the best or most effective way to protect our data - and I'm just as guilty of this as anyone else. It's the easy way but it's not the best way. Many of the leading storage and backup companies offer alternative strategies (ie snapshots) that are more effective and more efficient at data protection  - and do it at a lower cost. Each of the problems that backup deduplication solves are problems with the technology and less sophisticated IT departments opt to not solve the technology problem, they simply patch it.

Deduplication is a patch, not a solution.

So let me throw this question out - why are you using tradional backup software? Give me a technical reason, not a process reason.

 


Tagged in: Untagged 
Comments (31)Add Comment
mk408
I can't give such a reason
written by Max Kalashnikov, July 02, 2009
For at least 5 years, I've argued pretty strongly against backup software for much the same reason I argue against tape backup.

For longer than that, I've perceived there to be 3 distinct types of data protection, to which I give arbitrary names:

- archiving
- versioning
- performance

Actually, I think archiving is, perhaps, the least arbitrary, as there seems to be a sub-group of the storage industry which self-identifies this way. However, I'm not sure if my personal definition aligns with theirs. Archiving is also, perhaps, the least truly desirable of the three. It's truly long-term (OOM of decades) storage of data, which one never expects to use. The protection here is against degradation over time.

Versioning is probably where most traditional "backups" lie, in that the point is to be able to obtain data from a certain point in the past, often immutably and sometimes with hierarchical versions. Snapshots fall into this category, as do versioning filesystems, and SCM systems. These protect against human error or malice.

Performance (which I choose over "availability" or "access" for its avoidance of collision of initial letter with "archiving") is where I put technologies like RAID and replication. These protect against physical and service (from the PoV of data access) failures.

I have explicitly avoided mentioning disaster recovery, since that could be any or all of the three, depending on how one defines the disaster and the recovery.

Traditional backup software seems a bit of a kludgy stradling of archiving and versioning, without doing either one especially well.
ironmonk
Then what is...
written by Max Brackett, July 02, 2009
What is the most cost effective and efficient way to…
• grandfather backups (7 days, 4 weeks, 12 months)
• keep a backup offsite
• quickly access the backup
…without using deduplication such as DataDomain/ExaGrid/etc…?
mk408
...
written by Max Kalashnikov, July 02, 2009
Copy-on-write snapshots.
Tim
...
written by Tim Masters, July 02, 2009
mk408 is exactly right - it's much more efficient on storage, on replication and it provides much more granular restores depending on how frequently you snapshot.
StorageGrrl
...
written by Storage Girl, July 06, 2009
From a logic perspective, I can't think of a good reason to keep using backup software either. The problem is still the mountain of challenges (technical and political) when you try to change anything in a large environment. We still have a few NT systems in production which is absurd in 2008.
ironmonk
...
written by Max Brackett, July 06, 2009
Can copy-on-write snapshots restore individual files? What software do you suggest for COW snapshots?
Tim
...
written by Tim Masters, July 06, 2009
COW can restore individual files... and at much granular versioning than backup software.

As far as vendors go... I'm not going to recommend any since I need to be diplomatic here... but I'm sure others will chime in with options
wcpreston
My thoughts
written by W. Curtis Preston, July 06, 2009
The same thing was said of every major advancement in backup and recovery in the last 20 years.

My first thought is that if (1) everyone went to near-CDP-ready storage and (2) the near-CDP-ready storage fully understood all the apps that needed to be talked to, then you wouldn't need "backup" as we know it today. I would still want to use it as a second line of defense behind all those replicated snapshots, but it would no longer be a primary defense.

FWIW, many people HAVE done the above by using NetApp snapshot technology. NetApp's one of the few vendors that allow customers to have hundreds of snapshots and weeks and weeks of history without a performance technology. (Copy on write snapshots can't do that, BTW.)

BUT...

Many people don't want to swap out the infrastructure just to get good backups. (So #1 above is out.)

AND

There are plenty of applications that AREN'T handled well by today's products. (So #2 is out as well.0

SO

We're back at backups as the catch-all.

I completely disagree that backups today aren't much different than backups from 20 years ago. I did backups 20 years ago so I should know! There have been tons of advancements in backups since I joined the industry. The first big one I remember is database agents, then LAN-free backups, BCVs. Then there are snapshots (AKA near-CDP), CDP, source deduplication -- all of which are completely different than previous methods.

The problem is people won't change. I've spent a career trying to get people to change what they're doing and they almost always don't change much -- especially if you're trying to get them to do a wholesale change of their backup system.

So until that happens we need the finger in the dike.
mk408
...
written by Max Kalashnikov, July 06, 2009
I'm not sure that a CoW product needs to "understand" anything but block-level I/O. Moreover, although, obviously, there can be a performance hit for CoW, it does not follow that there must be. Do you have numbers?

CDP seems to me that it's solving (if it is at all) a slightly different problem than CoW snapshots. To call the latter near-CDP belies two of the major features of such snapshots, their atomicity, and the ability to manage their resource utlization separately. This is remarkably similar to traditional backups, hence my (and Tim's) suggestion that they are the modern substitute, without a huge paradigm shift, obviating the need for de-dupe.
wcpreston
...
written by W. Curtis Preston, July 07, 2009
Even the successful true CDP products have realized that they need some application-level awareness to be fully successful. While you CAN do CDP without any application-level awareness, and you can recover to any point in time, what vendors with some success in the space found out was that it's nice to have a stable, known point in time to go back to where you know what the status of the application was. An example of this is to go back to the last time that Oracle was put into backup mode. In fact, I would suggest that the level of success of a given CDP product is directly related to the degree of application awareness that they have. In fact, the most successful CDP products have been application-specific CDP products.

I have done a lot of testing with COW-based systems, and I can tell you that every single one of them I have tested experienced a SIGNIFICANT performance penalty (I'm talking as much as a 90% drop) if you do hundreds of snapshots and kept them for something like 90 days. I've done enough testing to believe that the problem is the concept of COW. It works fine for a few snapshots that are then used as a source for batch-based backup, but not as a replacement, as that would need 90 days of history or more. NetApp is not COW and does not share this problem. I'm not saying NetApp is perfect, but they're the best I've seen in this area. There are some up and comers that are doing similar things to copy-on-write, like redirect-on-write, and I haven't tested them so I can't comment on how their performance would be.

As to snapshot-based backup needing application awareness, it's an absolute requirement. Consider an Oracle database that's sitting on a snapshot-ready filesystem. Anyone that is creating snapshots of that Oracle database without putting it in backup mode first needs to rethink their strategy. It wouldn't be supported by Oracle or the snapshot vendor, nor would it be recommended by any backup expert that I know. So... My general statement is that a snapshot-based backup setup needs to be aware of the application it is backing up.

The reason that snapshots and replication are called near-CDP (by many more than just me) is that they share a lot of similarities. They are both block-level incremental forever (essentially replication), both can present to an application an instant read-write mount point that it can use immediately while the "real" storage is being repaired and restored, and both can "restore" to a previous point in time by simply moving some blocks around (i.e. NOT doing a full restore to go back 20 mins). The only difference between the two from a restore perspective is that CDP can restore to any point in time (literally to the last I/O write before something bad happened) and snapshot-based systems can restore to "near" that point (usually up to about an hour, based on when you last took a snapshot). So we call them "true CDP" and "near-CDP," respectively.

I don't see how anything I described in the previous paragraph has anything in common with batch backup. Near-CDP backups don't need dedupe, so I'm not sure why you said they do. Block-level, delta-snaphots essentially are dedupe, as the eliminate all the duplicated data found in traditional batch backup.
Tim
...
written by Tim Masters, July 07, 2009
Curtis:

It sounds like you believe CDP (or "near-cdp") solutions are viable options if you have application-aware agents and are prepared to change your infrastructure. I would agree with this.

As far as penalty goes... it really does depend on how frequently you snapshot. For most companies, one or two snapshots per day will suffice - but this opens up a discussion on snapshot and retention policies which I don't think enough vendors have developed best practices for since they want to be all things to all people.
meelo
...
written by Michael Mendez, July 07, 2009
I'd like to understand the scenario where CoW is not as good an option as backup software
wcpreston
You got it
written by W. Curtis Preston, July 07, 2009
Not only do I think they are viable alternatives, I think they are the best options today. But the backup world moves very slow.

I think that it is ultimately how "backups" will be done in the future, but it is going to take at least 10 years to catch on. Meanwhile, we keep applying band-aids like disk-backup and dedupe, cause that's what the people want. Little bits of change at a time.

It's not the frequency of snapshots but the number of them you keep that's the problem. With NetApp the answer is 256 as long as you want with no problem. With everyone else I've tested, divide that number by 10 at least if you want no performance degradation.
mk408
@wcpreston
written by Max Kalashnikov, July 07, 2009
I think you misunderstood me, in that I believe we agree that CoW snapshots are, themselves, a form of deduplication.

We also agree that NetApp snapshots, though mimicking the functionality of CoW, are implemented in a more clever fashion.

We further agree that anything batch-based, be it traditional backup or snapshots, needs to be application-aware overall. However, I think we disagree where that awareness must lie. I'm suggesting that it need only be at the level of the human administrator, not built into the technology.

Where we clearly disagree is in point of view on PIT recovery. I don't consider arbitrary PIT (a.k.a. CDP) to be a goal, but, rather, a method/solution. I consider the goal to be before-point-in-time recovery. That is, the business case is to recover to a point before a known-bad stae.

To this end, both snapshots and traditional backups provide a similar, in my mind, route: they are both batch or single point in time based. They can also both be copied, deleted, archived, or otherwise administratively handled. This seems impossible with CDP.

Where we may not disagree, though where I focus my skepticism, is performance. However, a 90% decrease is consistent with my own observations of typical sequential versus contentious loads. Did the solutions you looked at use the same spindles for the main data as for the snapshot storage? If so, the performance drop is likely due to the particular implementation and not anything fundamental to CoW.
DavidB
FalconStor CDP
written by David Bowers, July 07, 2009
We've been using FalconStor CDP (which I think Curtis would classify as "near cdp") with Oracle snapshot agents and it has worked flawlessly with Oracle. We take four snapshots per day and barely notice any performance degradation. Eventualy we will be adding Exchange agents to protect our email system which we are now using CommVault to backup.
wcpreston
I think we're close
written by W. Curtis Preston, July 07, 2009
@mk408

I think you misunderstand me! smilies/wink.gif

I don't care where the application awareness comes from. I just think it needs to exist. I also think that many admins will need it to be in the app or they won't have any application awareness. There are MANY people that are not comfortable with writing scripts to make something happen.

As to the whole goal/means discussion, I don't think we disagree there either. The goal is to minimize business interruption (downtime) and lost transactions, (lost work which must be repeated or lost). CDP does well at both. Near-CDP does well at the first, and much better than batch backup on the second.

As to snapshots and backup being similar, I couldn't disagree more. One creates duplicated data (full backups and full file incrementals) and needs dedupe; the other eliminates duplicate data (only doing delta-level transfers once a day/hour/minute). And here's the big one: backup requires a restore; near-CDP (snapshots) do NOT. You just mount the volume and you're off and running. The only way they are the same is that they are done on a scheduled basis. By that comparison, filling up my gas tank is the same as backup.

Yes, the implementations I looked at used separate volumes for snapshot data and where very proud of that. Ask them to take a snapshot hourly, keep those for a week, and keep one daily for 90 days. Watch them run, or watch them (as in the case of a certain large vendor and another large customer I witnessed) tell you that your requirement is stupid. BTW, if you've done what I'm describing on COW storage, I'd love to talk more about what you've done offline.
wcpreston
...
written by W. Curtis Preston, July 07, 2009
@DavidB

The problem is not the taking of the snapshots. It's the keeping of the snapshots. How many snapshots are you keeping on the primary storage? For example, some people keep only one snapshot on the primary storage, and use their replicated target to hold history. Others may take four a day, but keep one or two snapshots at a time and use one of them per day as a source for their backups.

The problem is when you try to replace backups entirely by keeping a long history of snapshots on both your primary and replicated copy. That's when you will see a performance degradation with many solutions.

Can you tell me a little more about what you're doing?
mk408
...
written by Max Kalashnikov, July 07, 2009
I'll start calling CDP "excessive-snapshots", since I still don't consider CoW snapshots to be "near" anything, but a distinct tool.

I'm an admin (and a technically minded one at that), so I'm naturally biased against giving over control ot software which purports to be aware of something external to itself.

I still disagree that snapshots and traditional backups are only the same with regard to scheduling. In fact, neither has that as a characteristic. Rather, they have the characteristic that they can be scheduled. They also share the characteristic of atomicity, which you have not addressed. A particular snapshot, just as a particular backup, can be kept, deleted, or replicated. Excessive-snapshotting takes away the scheduling and management options and therefore a substantial degree of control. The two also share a commonality in the underlying full vs incremental option. A mirror snapshot would correspond to a full backup, with copying, whereas a CoW snapshot corresponds to an incremental backup.

The requirement you describe isn't stupid, merely large, which seems appropriate for a user who could be similarly described. The total number of snapshots is 251, though likely the biggest challenge is the 168 hourly snaps. I have not, in fact, done anything quite that extensive. The closest would be 4-hour snaps for a week with vxfs "Storage Checkpoints." The trick was ensuring that there was enough spindle diversity, since, unlike what they call "snapshots," these don't explicitly use a separate volume. I believe that, now, vxfs supports multiple volumes and complex allocation policies, so it may be easy.
wcpreston
Back to the point
written by W. Curtis Preston, July 08, 2009
Before I get back to the point, let me address comments in your post.

First, I want to say PLEASE don't refer to true CDP as "excessive snapshots." The official definition of CDP specifically precludes the use of snapshots to deliver it. it does not use snapshots in any way. It is complete journaling of every single write -- continuously -- or it's not CDP. Even if a snapshot is taken every second, that's still not CDP. As I've told the near-CDP vendors, even a second is a period of time, and periodically is an antonym to continuous.

Second, I think I see where we're arguing (maybe). When I say snapshots, I do NOT include full-volume copies (i.e. split mirrors, BCVs, etc.). Those are not snaphots, they are copies. Only Veritas refers to a full-volume copy as a "snapshot," and it's confused the issue for anyone that's read their documentation. So when I use the term snapshot, I am referring to a virtual copy of a volume, not a full one.

So back to the point. The article said that "One thing that has not changed is backup. The same backup software you are using today is not all that different from the backup software used 20 years ago." And that just isn't true -- not even close. Near-CDP (or snapshots and replication if you prefer) represents a significantly better way to do backups than the full/incremental backup system that was state-of-the-art 20 years ago and common in most backup systems today.

But all near-CDP systems are not created equal. And unfortunately, I would argue that the requirements I specified are quite normal if what we're talking about is a backup system. 90 day retention of daily backups is a very normal requirement for a backup system. So -- back to the point -- the question was asked "why are we still using backup software?" My answer is that the alternatives are still not quite ready for many people's requirements, whether it's application awareness or the retention of enough data to meet typical operational recovery requirements -- many alternative systems are still not quite there. (I think some of them ARE there, but I'm trying to answer the question posed in the original post.

Let me defend the "near-CDP" term one more time, as it's a pretty common one. Now that you realize I'm not talking about full-volume copies as snapshots it may help. Both CDP and near-CDP incrementally transfer changed BLOCKS to a target system throughout the day. True CDP transfers this data immediately and continuously. Some near-CDP systems transfer changes immediately and continuously, others transfer changed blocks when you tell them to, but still do so incrementally, transferring only those blocks that have changed since the last time you told it to do so. So they're very close on the backup side. A true CDP system can recover to any point in time, and the near-CDP system can recovery NEAR to any point in time. (Much nearer than the typical 24-hour batch backup system can.) This is why people call it near-CDP.
mk408
...
written by Max Kalashnikov, July 08, 2009
How about "over-CoW"? smilies/smiley.gif

Thinking about what term I'd use for what you're calling CDP, I realized we're talking from different viewpoints. You're talking about the end goal (what), and I'm talking about the method/technology (how). I don't have a personal "what," so it's not CDP, meaning I don't consider any particular thing near-CDP. Similarly, I don't consider traditional backups to be near-snapshots. Even suggesting that the term should be read as "Nearcontinuous Data Protection" is a stretch for me, since, as you point out, there's no time quantum for continuity.

Continuous journaling (aka logging) sounds very much like a log-structured filesystem. Adding a log-structured component to a block-structured filesystem sounds quite a bit like NetApp's WAFL, not at all coincidentally, I'm sure. My over-CoW suggesting, though facetious, does, again, bring up the question of administrative control: is there any?

Call me old-fashioned, but, since Veritas was the first vendor I came across, 15 years ago, to offer a split-mirror feature, it's hard to abandon their terminology. I do agree that such a thing is a copy, but it's not just a copy. It's guaranteed to be a consistent/atomic, block-level copy (or snapshot, if you will) of the device at a point in time.

But, yes, back to the original question, I think there's a prerequisite question that must be asked. What consititutes "traditional backup software?" I've been going on the assumption that it is what decided when, (perhaps most importantly) what, and how (or where) to back up. This assumption is, obviously, wrong, if we're to include separate pieces of code, such as the underlying filesytem. Tim?
alexsons
How wrong you all can be?
written by Alex Sons, July 08, 2009
Wow. What a misperceptions!

"Why are you using tradional backup software? (technical reasons only)"

OK. First things first. I'm an expert (I think) in IBM's TSM backup/archiving software, not other products, so I'll answer it from a TSM perspective only, although most arguments would count equally for other major enterprise products.

1. Snapshots are a pain in the ass, especially for long term retention!
As you should all know, each problem gets its' own solution. Snapshots really were not meant for dealing with backup problems but meant for speed of restore. As most restores are about yesterdays data, snapshots dealt with the inadquacy of backup software in delivering backup data at the speed levels required by todays organizations. CDP technology takes this one step further by restoring data just created and never been backed up...

Snapshots are meant to create them on disk and keep them on disk. If you could and(!) would move older snapshots to tape consider the following. Say there is an EDP-auditor onsite and he requests last years data from a specific application in order to compare it with what is stored today. Would you want to deal with recovering from a years' worth of snapshots? And what if he also needs two year old data? You can replace this also with requests for old data çause of legal issues, not so uncommon anymore...

So in the end, you make backup/archive copies of your data also for long term purposes, with backup software! Snapshots and CDP really are technology mismatches.

2. Enterprise Backup Software shines in media management
Although TSM likely is the king here, media management is very important in dealing with loads of backup/archive data. The likelyhood of needing backup data normally detoriates over time so it becomes viable to move older backup data to tape. Some applications require lots of small files archived and be able to deliver files on request very quickly. In such organizations backup software like TSM can be used as a archive manager for files stored on WORM/UDO media, like Plasmon's libraries.

3. Human errors or worse?
It's very easy to have human errors delete complete RAID-sets. Sometimes you don't even need human errors, just plain hardware faults can do the same for you. It's almost impossible to completely wipe out both the tapelibrary and offfsite volumes. Even if you have an angry administrator trying to wipe out all online and backed up data, it is not hard at all to prevent administrators from access to offsite volumes.

4. Data Dedup?
Data deduplication is the answer for another problem. If you do want to store backup data on disk you quickly end up with loads of disks and loads of disk-related costs. Most of the time however you backup (almost) the same data. Dedup is a smart technology which greatly reduces the disk capacity needed for backup storage. It does not lessen the need for backup software which manages the stored backup data, whether it is stored on old-fashoned tape or fancy dedup'd disk.

I hope this helps.
wcpreston
We'll agree to disagree
written by W. Curtis Preston, July 08, 2009
I and many others are fine with the term "near-continuous data protection." You're not. I tried to change your mind. I give up. smilies/wink.gif

CDP is not just a journaling filesystem, because the idea behind a journaled filesystem is mainly about maintaining integrity, not the ability to go back in time, per se. A journaling filesystem should help unravel whatever happened when your server had its power switch flipped on you. This the same as the role of transaction logs in a database; they are used to roll back transactions that were in progress when something bad happens. Neither of these technologies, though, would roll you back to a point in time before that. A CDP system keeps track of every single write in the order they happened.

Assume that your original is not damaged, and you told the CDP product to "put me back to 5 minutes ago just before that idiot dropped a table." It knows which blocks have changed in the last five minutes and can just put them back to the state that they were before that. If the original was damaged, then it can be used as a standby copy while you're fixing the original, and it can restore the original if you had to completely reinitialize it.

This is similar to what NetApp can do with WAFL, but again, to a completely different level. Yes, snapmirror can be used to incrementally restore a volume to a previous point in time, but that point in time must be a snapshot. With CDP, it can be any point in time.

As to your last question, Tim didn't use the term "traditional backup software," you did. What he said was that what we have today is "not all that different from the backup software used 20 years ago." 20 years ago commercial backup software was in its infancy. The product that would eventually become NetBackup was being used only by Control Data at the time. (The launched the AWBUS business unit in 1990 and launch BackupPlus in 1993.) Legato was formed in 1988 and I don't think they were shipping yet. Cheyenne's NetBack had just been introduced and would soon be replaced by Arcserve. TSM wouldn't come out for 4 more years. Maynard Electronics had just released MaynStream. They would be acquired by Archive Corporation, who would come out with Backup Exec in the early 90s. I believe Alexandria had its birth in the early 90s, but can't find any reference about that. Hardly anyone was using commercial backup software in open systems 20 years ago. (What I WILL say is that what they did for Mainframe backup has remained mostly unchanged for 20 years. All the cool stuff we're talking about is only happening on open systems.)

20 years ago, the bulk of backup software in open systems was dump/tar/cpio to stand-alone tape. That's what I did. And to say that today's backup and recovery systems (when you include things like CDP, near-CDP, source dedupe software) is even remotely similar to what we did 20 years ago is simply not true. And that was my point.

Maybe Tim's original point is that what MOST people are doing is fundamentally the same as what they did 20 years ago. I will agree with that statement, but I also believe that this is because (for the most part) the industry hasn't offered them good enough options to switch.
wcpreston
Wrong?
written by W. Curtis Preston, July 08, 2009
@Alexons

I don't see how what you said makes anything that we've been saying "wrong."

I said that snapshots could be a replacement for backup if they were better than they are today. You said they're a pain in the ass. While I wouldn't go that far, we're essentially saying the same thing. I would also say that I don't think that all snapshot products are a PITA, that some are actually quite nice. (I'd put NetApp WAY out front here.)

You talked a lot about long term retention and discovery requests. Those are both the job of an archive app, not a backup app, and I am sticking to my guns on that. If one is keeping backup data for more than 18 months, I think one should re-examine what one is doing.
alexsons
Archiving versus backup?
written by Alex Sons, July 08, 2009
@wcpreston (and others?)
I really agree: one should use archiving moreoften than one does today. There is a problem however, i.e. when to decide data may be archived and when the data is still needed. Of course, email archiving is easy and commonplace, as received email will not be changed, only replied upon. Email is often referred to as semi-structured data.

For unstructured data however it is almost impossible to know when it is viable for archiving (and deleting from the filesystem). So that data will stay upon your systems forever. Although tiering storage might help in moving unused data to lower-cost media, it will still be online, and maybe someday edited.

Thus long term storage of backup data is still needed. If an organizations goes at length to implement ILM techniques to structure as much data as possible it'll be diminishing the need for old-fashioned backup software, and dedup'd backup storage on disk or snapshotting becomes more viable as a complete solution.

In the end I think most organisations cannot without competent old-fashioned backup software. In my practice all those new techniques are most suitable as an addon to, not as a replacement for that old-fashioned backup software.
wcpreston
Use archiving software
written by W. Curtis Preston, July 08, 2009
There are filesystem archiving products just like there are email archiving products. If you have files that should be kept long term, then you should use a filesystem archiving product to retrieve them. Backup software is NOT designed to help you find a file that you havent' seen for three years, but archiving software is.

As to the part of your discussion where you're talking about space reclamation, that's an easy one. A good archiving product can easily archive and then delete data for you. No worries.

But I still agree with your last paragraph. I just don't want them using it for archiving. smilies/wink.gif
Tomstr
...
written by T-SM Black, July 08, 2009
Good discussion, but I think think we have wandered from Tim's query: Is deduplication a strategy or a finger in the dike?
I propose that the answer is neither. Deduplication is a compression technology. It can take big piles of bits and make them smaller and fewer. Sometimes very successfully. Made viable by the increased processing power now cheaply available, it can be used in a variety of places in the data protection path to great advantage. It may enable, along with other compression technologies, longer usage of existing resources. One could claim that means "finger in the dike", but so is any technology or process improvement that doesn't require replacing the entire data center.

I belive that Tim is really asking, down in the meat of his blog, is "Does traditional batch backup methods and applications still have a role in a modern data center?" I say "yes", however, there are many places where better methods are available. as have been discussed above. Mirroring, and its close brethren BCV, CoW and CDP (logging), Snapshots (usually a variant of Delta Differencing) etc. All nice more or less realtime data protection methods. However, there still is a role for batch backups, even to Tape(gasp)! Its especially useful when media changes are necessary, especially for offsiteing. long term archiving, relocations, disaster preparation, and such (sneakernet can still be faster and cheaper then ethernet - and you get to keep a copy). But for minimizing downtime, yes, ther are much nicer things now.
alexsons
Is deduplication a strategy or a finger in the dike?
written by Alex Sons, July 09, 2009
Well, the first time I wrote was about the Tim's final question. This time I'll write about "the finger in the dike".
.
Backup with dedup originally was meant as a solution to the problem that one would like to store backup data to disk, but needed to much disks for that purpose. That translated in huge costs of course. Somewhere a smart guy thought of this solution and voilà, deduplication was born.
.
A simplification of course, nowadays there are a lot of dedup solutions. And there are a lot of different methods to choose from: gateway vs appliance, inline vs post-processing, etc.
.
Good reading material:
http://www.snia.org/education/tutorials/2009/spring/data-management/DanielBudiansky_Understanding_Data_Deduplication.pdf
.
Philosophical speaking: year-to-year online (as in "used by all production and test environments") data growth rates are about 100%. Rarely I encounter a lesser growth rate, sometimes I come across 150% growth rate or even more.
.
Dedup vendor promise you dedup ratios anywhere from 1:20 to 1:50 or more. I believe that in real world usage most dedup ratios are somewhere in between 1:3 to 1:10. For example, TSM uses the incremental backup method for file backups thus storing only changed files and will show a lower dedup ratio. If you would use NetBackup with daily full backups dedup ratios will be a lot higher.
.
In the end I assume everyone agrees that when storing backup data to disk dedup will help you lessen the needed disk capacity. That is not the same however as lowering your costs...maybe something for another blog to discuss about.
.
As said above, disk usage growth rates are about 100% a year. Let's assume this rate is not growing anymore and stays at 100% for the years to come. This will mean that if today your dedup solution needs 5 TB net disk capacity, you will need 40 TB net disk capacity in three years! This is fairly simple math. At the end of year one you need double (100% growth) the capacity, at the end of year two you would need four times and at the end of year three you would need eight(!) times of todays' net disk capacity. Dedup'd, that is.
.
A 5 TB dedup solution is very common today. Lots of organizations need much more capacity, ouch!
.
So yes, with current per TB licensing of dedup solutions, you may definitely speak of a "finger in the dike" solution. I therefore strongly recommend to look at any suitable solution for which you do not need to pay per TB storage. I did not found any so for, so if anyone here knows of any vendor wich use different pricing schemes I'd gladly see your posting!
josephmartins
So many thoughts, so little time.
written by joseph martins, July 09, 2009
First, in response to the title of Tim's post, and [as he requested] in the context of backups alone, the concept of deduplication is an extremely important consideration for any data protection strategy. We can all agree that we'd like to store as little as possible, preferably in the least amount of space, and still meet or beat our day-to-day operational requirements. Deduplication is all about keeping the physical amount of stored data to a minimum. And, faced with a future filled with mind-boggling amounts of new data, deduplication is a good thing.

What is important to understand is that deduplication (which appears to be a term born in the storage industry in the past decade) goes by many names and is best visualized as a spectrum of solutions designed to take the redundancy out of data. At one end of the spectrum we find file formats such as JPG, GIF, MP3, MPG, GZIP, TAR and SIT. These are examples of intra-file deduplication (a.k.a. file compression). Further along the spectrum we find single instance storage, a method of inter-file dedupe that has existed in many business applications since at least the early-to-mid 90s, possibly earlier. It's a simple implementation that identifies whole [byte for byte] duplicate files and stores a single copy. A lightweight system of pointers or stubs ensures that applications are unaware of the underlying data reduction. As we continue to move along the spectrum we encounter even more efficient methods of deduplication such as data chunking (at the block or sub-file level) and delta encoding. And storage vendors have, in recent years, added a new wrinkle to deduplication: timing. Should we deduplicate before or after moving our data over the network from point A to point B?

Commercial implementations of deduplication typically combine multiple methods, and all of them make trade-offs between complexity, efficiency and performance. There is no single universally superior method or commercial implementation of deduplication. You guessed it - it all depends on what you're trying to accomplish.

And, it really doesn't matter whether we're talking about primary or secondary storage, old backup technology or new, near-line, off-line, local, remote, backup or archival storage. They can all benefit from deduplication whether it's embedded or bolted-on.

Deduplication isn't a patch, it's an integral part of efficient information and storage management.
wcpreston
What they said
written by W. Curtis Preston, July 09, 2009
I concur with most of what the last two posters said. I do believe it is both a strategy for all storage in the future, and it is also a finger in the dike to help us migrate to a more disk-based approach to backup.

I personally do not refer to dedupe as compression. It is in the generic sense, in that it shrinks the data, but I think calling it compression confuses people. If it worked like compression (the concept we've known for years), then a 10:1 dedupe ratio would mean that I could store a 10 TB database in 1 TB of disk, when what it really means is that I can store my 1 TB database 20 times in 1 TB of disk. And I'm constantly having to explain this to people, which is why I do not use the term.
josephmartins
...
written by joseph martins, July 09, 2009
I would have to agree with you Curtis. I think part of the confusion lies in the conceptual gaps and overlap.

It certainly makes my job easier to use the term compression when discussing traditional intra-file data reduction techniques, and deduplication everywhere else.
ChrisFricke
...
written by Chris Fricke, July 09, 2009
I just read the whole conversation and now my head hurts. Thanks a lot!

Write comment
You must be logged in to post a comment. Please register if you do not have an account yet.

busy