This paper investigates large-scale distributed system design. It looks at features,
main design considerations and provides the Netflix API, Cassandra and Oracle
as examples of such systems. Moreover, the paper investigates the challenges of
designing, developing, deploying, and maintaining such systems, in regard to
the features presented. Finally, the paper discusses aspects of available
solutions and current practices to challenges that large-scale distributed
systems face.
References
[1]
Hajibaba, M. and Gorgin, S. (2014) A Review on Modern Distributed Computing Paradigms: Cloud Computing, Jungle Computing and Fog Computing. Journal of Computing and Information Technology, 22, 69-84. https://doi.org/10.2498/cit.1002381
[2]
Bagchi, S. (2015) Emerging Research in Cloud Distributed Computing Systems. IGI Global, USA, 158-166. https://doi.org/10.4018/978-1-4666-8213-9
[3]
Hierons, R. and Nunez, M. (2010) Testing Probabilistic Distributed Systems. In: Hatcliff, J. and Zucca, E., Eds., Formal Techniques for Distributed Systems. FMOODS 2010, FORTE 2010. Lecture Notes in Computer Science, Vol. 6117, Springer, Berlin, Heidelberg, 63-77. https://doi.org/10.1007/978-3-642-13464-7_6
Ahmed, W. and Wu, Y. (2013) A Survey on Reliability in Distributed Systems. Department of Computer Science and Technology, Tsinghua University, Beijing.
[6]
Colouris, G., Dollimore, J. and Kindberg, T. (2005) Distributed Systems Concepts and Design. Addison Wesley, ?Boston, MA.
[7]
IBM (n.d.) What Is Distributed Computing. IBM Knowledge Center. https://www.ibm.com/support/knowledgecenter/en/SSAL2T_8.2.0/com.ibm.cics.tx.doc/concepts/ c_wht_is_distd_comptg.html
[8]
Microsoft (2005, April 29) Centralized vs. Distributed Messaging System. TechNet. https://technet.microsoft.com/en-us/library/bb123575(v=exchg.65).aspx
[9]
Hussain, H., Malik, S., Hameed, A., Khan, S., Bickler, G., Min-Allah, N. and Rayes, A. (2013) Parallel Computing.
[10]
Netflix (2014, January 6) Working with Load Balancers. GitHub. https://github.com/Netflix/ribbon/wiki/Working-with-load-balancers
Bejoy, K.S. (2011, April 29) Word Count—Hadoop Map Reduce Example. Kick Start Hadoop. http://kickstarthadoop.blogspot.com/2011/04/word-count-hadoop-map-reduce-example.html
[14]
Walz, E. (2016, July 7) How Netflix Uses a Distributed Database Management System to Deliver Your Movies. LinkedIn. https://phantom448.wordpress.com/2016/07/14/how-netflix-uses-a-distributed-database- management-system-to-deliver-your-movies/
[15]
Apache Cassandra (n.d.) What Is Cassandra? Apache Cassandra. http://cassandra.apache.org/
[16]
Netflix Technology Blog (2015, December 9) High Quality Video Encoding at Scale. The Netflix Tech Blog. https://medium.com/netflix-techblog/high-quality-video-encoding-atscale-d159db052746
Diaconu, C., Freedman, C., Ismert, E., Larson, P.-A., Mittal, P., Stonecipher, R. and Zwilling, M. (2015) Hekaton: SQL Server’s Memory-Optimized OLTP Engine. Microsoft.
[19]
Oracle (n.d.) Using Multiversion Concurrency Control Chapter 3. Berkeley DB Features. Oracle Docs. https://docs.oracle.com/cd/E17276_01/html/bdb-sql/mvcc.html
[20]
Oracle (n.d.) Database Concepts. Oracle Help Center. https://docs.oracle.com/cd/B19306_01/server.102/b14220/consist.htm
[21]
Dean, J. (n.d.) Software Engineering Advice from Building Large-Scale Distributed Systems. Google User Content. https://static.googleusercontent.com/media/research.google.com/en//people/jeff/stanford-295- talk.pdf
[22]
Netflix Technology Blog (2012, February 29) Fault Tolerance in a High Volume, Distributed System. The Netflix Tech Blog. https://medium.com/netflix-techblog/fault-tolerance-in-a-highvolume-distributed-system-91ab4faae74a
[23]
Netflix Technology Blog (2015, July 14) Tracking down the Villains: Outlier Detection at Netflix. The Netflix Tech Blog. https://medium.com/netflix-techblog/tracking-downthe-villains-outlier-detection-at-netflix- 40360b31732
[24]
Leners, J., Wu, H., Hung, W.-L., Aguilera, M. and Walfish, M. (2011) Detecting Failures in Distributed Systems with the Falcon Spy Network. Proceedings of the 23rd ACM Symposium on Operating Systems Principles, Cascais, October 2011, 279-294. https://doi.org/10.1145/2043556.2043583
[25]
Harris, N. (2015, January 24) Visualizing DBSCAN Clustering. Naftali Harris. https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/
[26]
Yu, D. (2010, March) Recovery and Fault Tolerance. CSE 660 Operating Systems Concepts & Theory. http://cse.csusb.edu/tongyu/courses/cs660/notes/recovery.php
[27]
Netflix Technology Blog (2011, April 29) Lessons Netflix Learned from the AWS Outage. The Netflix Tech Blog. https://medium.com/netflix-techblog/lessons-netflix-learnedfrom-the-aws-outage-deefe5fd0c04
[28]
Sadhu, P., Parthasarathy, V. and Jami, A. (2012, February 21) Announcing Priam. The Netflix Tech Blog. https://medium.com/netflix-techblog/announcing-priam-4165565c7b07