CROSS-CLOUD CHAOS: AUTOMATED FAULT
INJECTION FOR VERIFYING CONSISTENCY IN
ACTIVE-ACTIVE HYBRID ARCHITECTURES

Abstract

The adoption of active-active hybrid cloud architectures has accelerated as organizations seek improved resilience and geographic distribution. However, these architectures introduce complex consistency challenges that traditional testing methods fail to adequately address. This research investigates automated fault injection techniques for verifying consistency guarantees in active-active hybrid cloud deployments. We developed a comprehensive chaos engineering framework that systematically injects network partitions, latency variations, and component failures across multiple cloud providers. The framework was evaluated using three production-grade distributed systems deployed across AWS, Azure, and Google Cloud Platform. Our findings reveal that 67% of tested systems exhibited consistency violations under specific failure scenarios that remained undetected by conventional testing approaches. The automated fault injection system successfully identified 142 distinct consistency anomalies, including split-brain scenarios, data divergence, and conflict resolution failures. Performance analysis demonstrates the framework can execute complete chaos experiments within 45 minutes while maintaining safety guarantees that prevent production impact. This research provides practical guidelines for implementing chaos engineering in multi-cloud environments and contributes empirical evidence regarding consistency vulnerabilities in active-active architectures.

Citation details of the article



Journal: International Journal of Applied Mathematics
Journal ISSN (Print): ISSN 1311-1728
Journal ISSN (Electronic): ISSN 1314-8060
Volume: 36
Issue: 4
Year: 2023

Download Section



Download the full text of article from here.

You will need Adobe Acrobat reader. For more information and free download of the reader, please follow this link.

References

  1. [1] Alvaro, P., Rosen, J. and Hellerstein, J.M. (2015) 'Lineage-driven fault injection', in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 331-346.
  2. [2] Basiri, A., Behnam, N., de Rooij, R., Hochstein, L., Kosewski, L., Reynolds, J. and Rosenthal, C. (2016) 'Chaos engineering', IEEE Software, 33(3), pp. 35-41.
  3. [3] Bermbach, D., Zhao, L. and Sakr, S. (2017) 'Towards comprehensive measurement of consistency guarantees for cloud-hosted data storage services', in Proceedings of the 5th International Conference on Performance Engineering, pp. 32-43.
  4. [4] Bermbach, D. and Tai, S. (2014) 'Eventual consistency: How soon is eventual?', in Proceedings of the 6th Workshop on Middleware for Service Oriented Computing, pp. 1-6.
  5. [5] Brewer, E.A. (2012) 'CAP twelve years later: How the "rules" have changed', Computer, 45(2), pp. 23-29.
  6. [6] Dean, R., Kumar, S. and Zhang, Y. (2021) 'Chaos engineering for microservices: Principles and practices', Journal of Systems and Software, 178, 110972.
  7. [7] Falcone, Y., Havelund, K. and Reger, G. (2018) 'A tutorial on runtime verification', in Engineering Dependable Software Systems, pp. 141-175.
  8. [8] Flexera (2023) State of the Cloud Report 2023. Available at: https://www.flexera.com/blog/cloud/cloud-computing-trends-2023-state-of-the-cloud-report/.
  9. [9] Gilbert, S. and Lynch, N. (2002) 'Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services', ACM SIGACT News, 33(2), pp. 51-59.
  10. [10] Khanna, D. (2022) The Lean Cloud: Scaling from Zero to Millions on a Budget. USA. ISBN: 978-1-9705-9697-7.
  11. [11] Kingsbury, K. (2013) Jepsen: Testing the Partition Tolerance of PostgreSQL, Redis, MongoDB and Riak. Available at: https://aphyr.com/posts/281-jepsen-on-the-perils-of-network-partitions.
  12. [12] Maurer, M., Breskovic, I., Emeakaroha, V.C. and Brandic, I. (2021) 'Revealing the MAPE loop for the autonomic management of Cloud infrastructures', in Proceedings of the IEEE Symposium on Computers and Communications, pp. 147-152.
  13. [13] Musuvathi, M., Qadeer, S., Ball, T., Basler, G., Nainar, P.A. and Neamtiu, I. (2008) 'Finding and reproducing Heisenbugs in concurrent programs', in Proceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation, pp. 267-280.
  14. [14] Russello, P., Libutti, S. and Barbierato, E. (2020) 'Chaos engineering for resilience and availability verification in microservice architectures', in Proceedings of the 2020 IEEE International Symposium on Software Reliability Engineering Workshops, pp. 156-161.
  15. [15] Scheuner, J. and Leitner, P. (2020) 'Function-as-a-Service performance evaluation: A multivocal literature review', Journal of Systems and Software, 170, 110708.
  16. [16] Shapiro, M., PreguiƧa, N., Baquero, C. and Zawirski, M. (2011) 'Conflict-free replicated data types', in Proceedings of the 13th International Symposium on Stabilization, Safety, and Security of Distributed Systems, pp. 386-400.
  17. [17] Wu, Y., Chen, J., Wang, Y. and Zhang, H. (2022) 'Convergence analysis of conflict resolution in geo-replicated systems', in Proceedings of the 2022 IEEE International Conference on Cloud Computing Technology and Science, pp. 234-241.