In deduplication, index-lookup disk bottleneck is a major obstacle which limits the throughput of backup processes. One way to minimize the effect of this issue and boost speed is to use very high course-grained chunks for deduplication at a cost of low storage saving and limited scalability. Another way is to distribute the deduplication process among multiple nodes but this approach introduces storage node island effect and also incurs high communication cost. In this paper, we explore dCACH, a content-aware clustered and hierarchical deduplication system, which implements a hybrid of inline course grained and offline fine-grained distributed deduplication where routing decisions are made for a set of files instead of single files. It utilizes bloom filters for detecting similarity between a data stream and previous data streams and performs stateful routing which solves the storage node island problem. Moreover, it exploits the negligibly small amount of content shared among chunks from different file types to create groups of files and deduplicate each group in their own fingerprint index space. It implements hierarchical deduplication to reduce the size of fingerprint indexes at the global level, where only files and big sized segments are deduplicated. Locality is created and exploited first using the big sized segments deduplicated at the global level and second by routing a set of consecutive files together to one storage node. Furthermore, the use of bloom filter for similarity detection between streams has low communication and computation cost while it enables to achieve duplicate elimination performance comparable to single node deduplication. dCACH is evaluated using a prototype deployed on a server environment distributed over four separate machines. It is shown to have 10× the speed of Extreme_Binn with a minimal communication overhead, while its duplicate elimination effectiveness is on a par with a single node deduplication system.
References
[1]
Xia, W., Jiang, H., Feng, D., Douglis, F., Shilane, P., Hua, Y., Fu, M., Zhang, Y. and Zhou, Y. (2016) A Comprehensive Study of the Past, Present, and Future of Data Deduplication. Proceedings of the IEEE, 104, 1681-1710.
https://doi.org/10.1109/JPROC.2016.2571298
[2]
Zhu, B., Li, K. and Patterson, H. (2008) Avoiding the Disk Bottleneck in the Data Domain Deduplication File System. In: Proceedings of the 6th USENIX Conference on File and Storage Technologies, USENIX Association, Berkeley, 18:1-18:14.
[3]
Meister, D. and Brinkmann, A. (2009) Multi-Level Comparison of Data Deduplication in a Backup Scenario. Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference, Haifa, Article No. 8. https://doi.org/10.1145/1534530.1534541
[4]
Quinlan, S. and Dorward, S. (2002) Venti: A New Approach to Archival Data Storage. Proceedings of the 1st USENIX Conference on File and Storage Technologies, USENIX Association, Berkeley, 89-101.
[5]
Fu, Y., Jiang, H., Xiao, N., Tian, L. and Liu, F. (2011) AA-Dedupe: An Application-Aware Source Deduplication Approach for Cloud Backup Services in the Personal Computing Environment. IEEE International Conference on Cluster Computing, Austin, 26-30 September 2011, 112-120.
https://doi.org/10.1109/CLUSTER.2011.20
[6]
Liu, C., Lu, Y., Shi, C., Lu, G., Du, D.H.C. and Wang, D.-S. (2008) ADMAD: Application-Driven Metadata Aware De-Duplication Archival Storage System. 5th IEEE International Workshop on Storage Network Architecture and Parallel I/Os, Computer Society Press, Baltimore, 29-35. https://doi.org/10.1109/SNAPI.2008.11
[7]
Wei, J., Jiang, H., Zhou, K. and Feng, D. (2010) MAD2: A Scalable High-Throughput Exact Deduplication Approach for Network Backup Services. IEEE 26th Symposium on Mass Storage Systems and Technologies, Lake Tahoe, 6-7 May 2010, 1-14. https://doi.org/10.1109/MSST.2010.5496987
[8]
Bhagwat, D., Eshghi, K., Long, D. and Lillibridge, M. (2009) Extreme Binning: Scalable, Parallel Deduplication for Chunk-Based File Backup. IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems, London, 21-23 September 2009, 1-9.
https://doi.org/10.1109/MASCOT.2009.5366623
[9]
Lillibridge, M., Eshghi, K., Bhagwat, D., Deolalikar, V., Trezise, G. and Camble, P. (2009) Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality. Proceedings of the 7th Conference on File and Storage Technologies, USENIX Association, Berkeley, 111-123.
[10]
Dubnicki, C., Gryz, L., Heldt, L., Kaczmarczyk, M., Kilian, W., Strzelczak, P., Szczepkowski, J., Un-gureanu, C. and Welnicki, M. (2009) HYDRAstor: A Scalable Secondary Storage. Proceedings of the 7th Conference on File and Storage Technologies, USENIX Association, Berkeley, 197-210.
[11]
Dong, W., Douglis, F., Li, K., Patterson, H., Reddy, S. and Shilane, P. (2011) Tradeoffs in Scalable Data Routing for Deduplication Clusters. Proceedings of the 9th USENIX Conference on File and Storage Technologies, USENIX Association, Berkeley, 15-29.
[12]
Fu, Y., Jiang, H. and Xiao, N. (2012) A Scalable Inline Cluster Deduplication Framework for Big Data Protection. ACM/IFIP/USENIX International Conference on Distributed Systems Platforms and Open Distributed Processing, Montreal, 3-7 December 2012, 354-373. https://doi.org/10.1007/978-3-642-35170-9_18
[13]
Dagnaw, G., Hua, W. and Zhou, K. (2018) CACH-Dedup: Content Aware Clustered and Hierarchical Deduplication. 24th IEEE International Conference on Parallel and Distributed Systems, Singapore, 11-13 December 2018, 399-407.
https://doi.org/10.1109/PADSW.2018.8644884
[14]
Agrawal, N., Bolosky, W.J., Douceur, J.R. and Lorch, J.R. (2007) A Five-Year Study of File-System Metadata. ACM Transactions on Storage, 3, 9.
https://doi.org/10.1145/1288783.1288788
[15]
Fu, Y., Jiang, H., Xiao, N., Tian, L., Liu, F. and Xu, L. (2014) Application-Aware Local-Global Source Deduplication for Cloud Backup Services of Personal Storage. IEEE Transactions on Parallel and Distributed Systems, 25, 1155-1165.
https://doi.org/10.1109/TPDS.2013.167
[16]
Tan, Y., Jiang, H., Feng, D., Tian, L., Yan, Z. and Zhou, G. (2010) Sam: A Semantic-Aware Multi-Tiered Source De-Duplication Framework for Cloud Backup. 39th International Conference on Parallel Processing, San Diego, 13-16 September 2010, 614-623. https://doi.org/10.1109/ICPP.2010.69
[17]
Xia, W., Jiang, H., Feng, D. and Hua, Y. (2011) Silo: A Similarity-Locality Based Near-Exact Deduplication Scheme with Low Ram Overhead and High Throughput. Proceedings of USENIX ATC, Portland, 15 June 2011, 26-28.
[18]
Li, Y.-K., Xu, M., Ng, C.-H. and Lee, P.P.C. (2014) Efficient Hybrid Inline and Out-of-Line Deduplication for Backup Storage. ACM Transactions on Storage, 11, 1-21. https://doi.org/10.1145/2641572
[19]
Yang, T., Jiang, H., Feng, D., Niu, Z., Zhou, K. and Wan, Y. (2010) DEBAR: A Scalable High-Performance Deduplication Storage System for Backup and Archiving. IEEE International Symposium on Parallel & Distributed Processing, Istanbul, 7-9 July 2010, 1-12. https://doi.org/10.1109/IPDPS.2010.5470468
[20]
Zhang, J., Zhang, S., Lu, Y., Zhang, X. and Wu, S. (2013) Hierarchical Data Deduplication Technology Based on Bloom Filter Array. Lecture Notes in Electrical Engineering, Proceedings of the International Conference on Information Engineering and Applications, 1, 725-732.
https://doi.org/10.1007/978-1-4471-4856-2_88
[21]
Zhou, Y., Feng, D., Xia, W., Fu, M., Huang, F., Zhang, Y. and Li, C. (2015) SecDep: A User-Aware Efficient Fine-Grained Secure Deduplication Scheme with Multi-Level Key Management. 31st Symposium on Mass Storage Systems and Technologies, Santa Clara, 1-5 June 2015, 1-14.
https://doi.org/10.1109/MSST.2015.7208297
[22]
Fu, Y., Xiao, N., Jiang, H., Hu, G. and Chen, W. (2017) Application-Aware Big Data Deduplication in Cloud Environment. IEEE Transactions on Cloud Computing, 1.
https://doi.org/10.1109/TCC.2017.2710043
[23]
Luo, S., Zhang, G., Wu, C., Khan, S. and Li, K. (2015) Boafft: Distributed Deduplication for Big Data Storage in the Cloud. IEEE Transactions on Cloud Computing, 1. https://doi.org/10.1109/TCC.2015.2511752
[24]
Kulkarni, P., Douglis, F., LaVoie, J. and Tracey, J.M. (2004) Redundancy Elimination within Large Collections of Files. Proceedings of the Annual Conference on USENIX Annual Technical Conference, USENIX Association, Berkeley, 59-72.
[25]
Muthitacharoen, A., Chen, B., et al. (2001) A Low-Bandwidth Network File System. Proceedings of the 18th ACM Symposium on Operating Systems Principles, Banff, 21-24 October 2001, 174-187. https://doi.org/10.1145/502034.502052
[26]
Policroniades, C. and Pratt, I. (2004) Alternatives for Detecting Redundancy in Storage Systems Data. In: Proceedings of the Annual Conference on USENIX Annual Technical Conference, USENIX Association, Berkeley, 73-86.
[27]
Bloom, B.H. (1970) Space/Time Trade-Offs in Hash Coding with Allowable Errors. Communications of the ACM, 13, 422-426. https://doi.org/10.1145/362686.362692
[28]
Lu, G., Debnath, B. and Du, D.H. (2011) A Forest-Structured Bloom Filter with Ash Memory. IEEE 27th Symposium on Mass Storage Systems and Technologies, Denver, 23-27 May 2011, 1-6.
[29]
Wang, J., Zhao, Z., Xu, Z., Zhang, H., Li, L. and Guo, Y. (2015) I-Sieve: An Inline High Performance Deduplication System Used in Cloud Storage. Tsinghua Science and Technology, 20, 17-27. https://doi.org/10.1109/TST.2015.7040510
[30]
Lin, C., Cao, Q., Huang, J., Yao, J., Li, X. and Xie, C. (2018) HPDV: A Highly Parallel Deduplication Cluster for Virtual Machine Images. 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, Washington DC, 1-4 May 2018, 472-481. https://doi.org/10.1109/CCGRID.2018.00074
[31]
Rabin, M.O. (1981) Fingerprinting by Random Polynomials. Tech. Rep., Center of Research in Computer Technology, Technical Report.
[32]
Rivest, R. (1992) The md5 Message-Digest Algorithm.
https://doi.org/10.17487/rfc1321
[33]
Broder, A.Z., Charikar, M., Frieze, A.M. and Mitzenmacher, M. (1998) Min-Wise Independent Permutations. Journal of Computer and System Sciences, 60, 327-336.
[34]
Broder, A. and Mitzenmacher, M. (2004) Network Applications of Bloom Filters: A Survey. Internet Mathematics, 1, 485-509.
https://doi.org/10.1080/15427951.2004.10129096
[35]
Donnet, B., Gueye, B. and Kaafar, M.A. (2012) Path Similarity Evaluation Using Bloom Filters. Computer Networks, 56, 858-869.
https://doi.org/10.1016/j.comnet.2011.11.003
[36]
FSL (2014) Traces and Snapshots Public Archive. http://tracer._lesystems.org
[37]
Xia, W., Jiang, H., Feng, D. and Tian, L. (2016) DARE: A Deduplication-Aware Resemblance Detection and Elimination Scheme for Data Reduction with Low Overheads. IEEE Transactions on Computers, 65, 1692-1705.
https://doi.org/10.1109/TC.2015.2456015
[38]
Zhang, P., Huang, P., He, X., Wang, H., Yan, L. and Zhou, K. (2016) RMD: A Resemblance and Mergence Based Approach for High Performance Deduplication. 45th International Conference on Parallel Processing, Philadelphia, 16-19 August 2016, 536-541. https://doi.org/10.1109/ICPP.2016.68
[39]
Fu, M., Feng, D., Hua, Y., He, X., Chen, Z., Xia, W., Huang, F. and Liu, Q. (2014) Accelerating Restore and Garbage Collection in Deduplication-Based Backup Systems via Exploiting Historical Information. USENIX Annual Technical Conference, Philadelphia, 19-20 June 2014, 181-192.
[40]
Tan, Y., Wang, B., Wen, J., Yan, Z., Jiang, H. and Srisa-an, W. (2018) Improving Restore Performance in Deduplication-Based Backup Systems via a Fine-Grained Defragmentation Approach. IEEE Transactions on Parallel and Distributed Systems, 29, 2254-2267. https://doi.org/10.1109/TPDS.2018.2828842
[41]
Wu, J., Hua, Y., Zuo, P. and Sun, Y. (2018) Improving Restore Performance in Deduplication Systems via a Cost-Efficient Rewriting Scheme. IEEE Transactions on Parallel and Distributed Systems, 30, 119-132.
https://doi.org/10.1109/TPDS.2018.2852642
[42]
Cao, Z., Wen, H., Wu, F. and Du, D.H. (2018) ALACC: Accelerating Restore Performance of Data Deduplication Systems Using Adaptive Look-Ahead Window Assisted Chunk Caching. 16th fUSENIXg Conference on File and Storage Technologies, Oakland, 12-15 February 2018, 309-324.
[43]
Kaczmarczyk, M. and Dubnicki, C. (2015) Reducing Fragmentation Impact with Forward Knowledge in Backup Systems with Deduplication. In: Proceedings of the 8th ACM International Systems and Storage Conference, ACM, New York, 17.
https://doi.org/10.1145/2757667.2757678