The impact on analytics and data science of the trend towards adopting blockchain technology for data processing
Introduction
Every day we are creating more and more data. It is estimated that by 2025 globally, 463 exabytes (1 EB = 1,0006 bytes) of data will be created daily (Desjardins 2019). In an ever-increasing competitive marketplace, businesses need analytics and data science to optimize performance, uncover hidden patterns, and make better business decisions. Good quality data is accurate, complete, conformant, consistent, timely, unique, and valid.
Blockchain, a technology built on distributed transaction processing and encryption, was developed to support the launch of Bitcoin in 2009. Since then many cryptocurrencies have proliferated. Other applications of blockchain have also grown such as smart contracts, supply chain management, as well as innovations in the fintech, healthcare, retail, and real estate industries. (Shah, Forester, Berberich, & Raspé, 2019). The features of blockchain apply to any industry, and they specifically address the concerns about data correctness and security that limit the use and sharing of big data (van Rijmenam, 2019).
Data science is about making predictions from a large amount of data and transforming the nature of transactions. Blockchain is about recording and validating data. It is changing data management. “If Big Data is the quantity, blockchain is the quality” (De Meijer, 2019 p11).
This literary review will discuss the limitations and pitfalls of blockchain for data processing. In particular, it will look at data privacy, immutability, security, and scalability. It covers literature (i.e. journals, articles, weblogs) written between 2016 and 2019.
As Engin & Treleaven (2019 p456) state the “technology is still at its infancy, especially with regards to security and privacy. The lack of standards, scalability, storage, access, change management and security against cybercriminals can be mentioned as some of the key areas of concern.”
Data Privacy
Over the past decade or so, there has been an increase in concern around data privacy. Whilst the ability of data analytics to discover hidden insights and help organisations deliver better products and services, this can have significant impact on personal privacy by: collating data from a variety of sources, generating new information, and retaining data for a longer time than usual. Changes to data privacy have the potential to affect blockchain technology users. The European Union takes an all-in approach with the General Data Protection Regulation (GDPR). It defines personal data to include any information relating to an identified or identifiable individual. It also applies its standards to anyone who offers goods or services to its residents (Shah, et al. 2019).
In contrast in the United States at a federal level data privacy is sector-specific, for example, the Gramm-Leach-Bililey Act (GLBA) for financial institutions, and the Health Insurance Portability and Accountability Act of 1996 (HIPAA) for the health industry. Meanwhile, some states, such as California, have broader legislation, for example, the California Consumer Privacy Act of 2018 (CCPA). The CCPA defines personal information as that which “identifies, relates to, describes, is reasonably capable of being associated with, or could reasonably be linked, directly or indirectly, with a particular consumer or household”.
This can be challenging to blockchain users in ensuring the legalities such as consent, data access, rectification, data portability, and data removal (“the right to be forgotten”). (Shah, et al. 2019).
According to Munn, Hristova & Magee (2019) the distributed character of blockchain poses challenges. Users are not exposed to threats in a uniform way as different blockchain technologies produce different inflections of privacy. The inability to remove or amend records may violate privacy regulations. The ledger is open to exploitation to re-identify addresses on the transactions.
As Primavera De Filippi (2016, p1) states, “anyone can retrieve the history of all transactions performed on a blockchain and rely on big data analytics in order to retrieve potentially sensitive information”. This viewpoint is also supported by Ishmaev (2019) who argues that “it is crucial to consider not only control over access to the private data and technological artifacts [sic], but also privacy invasions based on the inferred information”.
Mirchandani (2019) argues that whilst permissioned blockchain may align with the GDPR’s goals of accuracy and transparency, as the regulation is written it is likely that blockchain technology violates it. She goes on to say that there are promising developments in permissioned blockchains that allow users to have greater control over who has access to their data.
Immutability
One of the advantages touted of blockchains technology is that the blockchain is immutable, i.e. once written it cannot be amended, rather like carving in stone. Blockchains are an “append-only” system. Immutability is achieved by including a hash or digital signature for its block plus for the previous block as well. It has the potential to transform the auditing process and bring more trust and integrity to the data that organisations are using. As van Rijmenam (2019) states organisations that want to apply blockchain will need to ensure that their big data is correct and of the highest standards. If done correctly, blockchain could be a catalyst for better data, resulting in better insights.
Data science service provider, Flatworld Solutions (n.d.) argues that it is easy to get sucked into the hype of blockchain and that the technology is not the most suitable for organisations where transactional malleability and centralisation are extremely important, e.g. organisations reliant heavily on data entry.
As Lamb, Treat, & Jelf (2016 p4) point out even the smartest of contracts are susceptible to human error. They cite an example where hackers stole more than $60 million of the digital currency ether from The DAO, a high-profile start-up fund. This was possibly due to an error in the programming of a smart contract code. “Fat Finger” moments are a concern especially in the financial services industry and for them to embrace blockchain technology for enterprise and permission networks, it cannot be one where human errors are immutable. Take, for example, the news story of the ‘accidental’ millionaires in New Zealand. In 2009, Westpac bank accidently gave a gas station owner, Hui “Leo” Gao, access to A$7 million (100 times the intended amount). He, with his girlfriend, withdrew the money and fled the country (CNN, 2012).
Rights of data correction and data removal, (i.e. “the right to be forgotten”), present the most apparent conflict with the transaction immutability characteristics of blockchain technology (Shah, et al. 2019).
Security and User Collusion
Loss of control of key creates problems be it misplacing the device that stores the private key or an attacker having access to the private key and transferring from the private key to another private key controlled by the hacker. If a hard drive fails or they forget they are effectively locked out of the resource and any further transactions with the asset are inhibited (Jaikaran, 2018).
Jaikaran also argues that groups of users may combine computing resources and collude to mine blocks, thereby wielding influence over which transactions are appended to the block and the blocks that are posted. This is known as a 51% attack. A 51% attack is defined by Investopedia (2019) as “an attack on a blockchain – most commonly bitcoins…- by a group of miners controlling more than 50% of the network's mining hash rate or computing power”. According to Cangemi & Brennan (2019) a 51% attack on a blockchain ledger is possible even if improbable. They maintain that a properly designed blockchain is more secure than a tradition central database, as public of private keys replace passwords, the consensus mechanism replaces change management, and a complete distributed ledge replaces archiving.
Another challenge is quantum computing.
“Quantum computing takes advantage of the strange ability of subatomic particles to exist in more than one state at any time. Due to the way the tiniest of particles behave, operations can be done much more quickly and use less energy than classical computers” (Beall & Reynolds, 2018, p3).
Presently, it is assumed that the inversion of hashes is computationally difficult. Quantum computers could compromise the authenticity of the blockchain entries should inversion become possible. Quantum computers could also break the encryption of the blockchain public/private key (Rodenburg & Pappas, 2017).
Scalability and Sustainability
Organisations are looking to data analytics and data science to reduce costs and make money.
Compared to traditional storage, data storage on a blockchain is expensive (Sarikaya 2019). “Limited transaction throughput and storage are widely-known problems of blockchain technology” (Vo, Mohania, Verma, & Mehedy 2018 p20). As Blockchains get bigger they take up more storage and thus become slower. In 2018 Ethereum’s blockchain surpassed 1TB. This is not just a problem of storage on the nodes but to the network as well. Larger blockchains take longer to copy to new nodes on the network. Due to the fixed block size of blockchains they do not scale well with high volume transactions (Tabora 2018). It takes about ten minutes to add a new block to a blockchain. This translates to seven transactions per second compared with the legacy brand Visa which can process 24,000 transactions per second (Investopedia, 2020). As Wheen (2018) points out that while memory and bandwidth costs are coming down fast, scalability will become more serious if not addressed.
Akcora, Dixon, Gel, & Kantarcioglu (2019) posit that a criticism of blockchain is that data querying is time consuming as data blocks are written into files on disk. There have been developments in blockchain query languages but their use to date is limited.
Energy consumption is a significant concern (Wheen, 2018). Just how much energy is required? The “proof of work” system consumes vast amounts of computational power (Investopedia, 2020). In 2017, Malmo reported that each bitcoin transfer used enough energy to run a comfortable American household for nearly a week. Or put another way bitcoin miners worldwide use enough electricity at any given time to power approximately 2.26 million American homes. Research company Elite Fixtures found that the cost of mining a single bitcoin can vary drastically by location, from $531 to a whopping $26,170 (Investopedia, 2020). As Johnson (2018, p22) points out “this creates a serious question in terms of the environmental and sustainability impacts of blockchain technology”.
To be useful in data processing, blockchain needs to be prepared for the billions of transactions it must support and log. If blockchain is a lot more expensive than a less ideal alternative the uptake of blockchain will be slow and its success not guaranteed.
Conclusion
Dirty data, or erroneous information, has always been a concern of data analytics and science. The fact that blockchain offers a record of consensus and an audit trail that can be maintained and validated indicates that the technology could be beneficial in both public and private sector applications (Treleaven, Brown, & Yang, 2017).
The major concern around data privacy is ensuring the legalities of the various jurisdictions are met. How to handle the issues of consent, data access, rectification, data portability, and data removal need to be addressed.
Immutability is a double-edged sword. Whilst the fact that the blockchain technology engenders trust and authenticity it is vulnerable to any errors whether intentional or not. Quantum computing could also have a serious impact on the security of the blockchain technology. I think this is where there is a major pitfall of blockchain and its impact on analytics and data science for data processing. No matter how secure we think a system is there is always someone out there willing to put their hacker skills to the test and not always for the benefit of any but themselves.
At the moment blockchain appears to be expensive, especially in terms of its environmental impact. Whilst storage costs continue to fall, given the amount of data that is being created each day, scalability and sustainability of blockchain will be a concern.
In general, the literature reviewed considered blockchain technology to be beneficial and that developments are underway to solve any limitations or pitfalls. Some people, such as De Meijer(2019) believe that big data and blockchain is a ‘great marriage’. Time will tell.
References
Akcora, C.G., Dixon, M.F., Gel, Y.R., & Kantarcioglu, M. (2019). Blockchain Data Analytics. Journal of IEEE Intelligent Informatics, 20(1),
http://math.iit.edu/~mdixon7/block_chain_analytics.pdf
Beall, A. & Reynolds, M. (2018, February 16). What are quantum computers and how do they work? WIRED explains. Wired. Retrieved 26 February, 2020 from https://www.wired.co.uk/article/quantum-computing-explained
California Consumer Privacy Act 2018. Retrieved 22 February, 2020 from
https://leginfo.legislature.ca.gov/faces/codes_displaySection.xhtml?lawCode=CIV§ionNum=1798.140.
Cangemi, M.P. & Brennan, G. (2019). Blockchain auditing – Accelerating the need for automated audits! EDPACS, 59(4), 1-11, doi: 10.1080/07366981.2019.1615176
CNN (2012, August 24) ‘Accidental' millionaire's spending spree ends in prison Retrieved 22 February, 2020 from
De Filippi, P. (2016). The interplay between decentralization and privacy: The case of blockchain technologies. Journal of Peer Production 9. Retrieved 22 February, 2020 from
De Meijer, C.R.W. (2019, January 29) Blockchain and big Data: A great marriage. Finextra. Retrieved 17 February, 2020 from
https://www.finextra.com/blogposting/16596/blockchain-and-big-data-a-great-mariage
Desjardins, J. (2019, April 17). How much data is generated each day? World Economic Forum. World Economic Forum. Retrieved 24 February, 2020 from
https://www.weforum.org/agenda/2019/04/how-much-data-is-generated-each-day-cf4bddf29f/
Engin, E. & Treleaven, P. (2019)) Algorithmic government: Automating public services and supporting civil servants in using data science technologies
The Computer Journal, 62(3), 448–460
https://academic-oup-com.elibrary.jcu.edu.au/comjnl/article/62/3/448/5070384
Flatworld Solutions (n.d.) Big data and blockchain analytics – Is that a perfect match? Retrieved 17 February, 2020 from
Investopedia. (2019, May 6). 51% Attack. Retrieved 6 February, 2020 from
https://www.investopedia.com/terms/1/51-attack.asp
Investopedia. (2020, February 1). Blockchain Explained. Retrieved 22 February, 2020 from
https://www.investopedia.com/terms/b/blockchain.asp
Ishmaev, G. (2019) The ethical limits of blockchain-enabled markets for private IoT data. Philosophy & Technology https://doi.org/10.1007/s13347-019-00361-y
Jaikaran, C., (2018) Blockchain: Background and Policy Issues. Congressional Research Service https://fas.org/sgp/crs/misc/R45116.pdf
Johnson, K.D. (2018) Blockchain Technology: Implications for Development. Risk Innovation Lab. Arizona: Arizona State University.
Lamb, R., Treat, D., Jelf, O. (2016). Editing the uneditable blockchain: Why distributed ledger technology must adapt to an imperfect world. Accenture. Retrieved 7 February, 2020 from
https://www.accenture.com/_acnmedia/pdf-33/accenture-editing-uneditable-blockchain.pdf
Malmo, C. (2017, November 2) One Bitcoin Transaction Consumes As Much Energy As Your House Uses in a Week. Vice. Retrieved 22 February, 2020 from
Mirchandani, A. (2019). The GDPR-blockchain paradox: Exempting permissioned blockchains from the GDPR, 29 Fordham Intellectual Property, Media and Entertainment Law Journal, 29(4), 1199-1241
https://ir.lawnet.fordham.edu/cgi/viewcontent.cgi?article=1730&context=iplj
Munn, L., Hristova, T., & Magee, L. (2019) Clouded data: Privacy and the promise of encryption. Big Data & Society, January – June 2019, 1-16 https://journals-sagepub-com.elibrary.jcu.edu.au/doi/pdf/10.1177/2053951719848781
Rodenburg, B. & Pappas, S.P. (2017). Blockchain and quantum computing. The Mitre Corporation.
Sarikaya, S. (2019, January 6) How Blockchain Will Disrupt Data Science: 5 Blockchain Use Cases in Big Data. Towards Data Science
Retrieved 7 February, 2020 from
Shah, P., Forester, D., Berberich, M., & Raspé, C. (2019) . Blockchain Technology: Data Privacy Issues and Potential Mitigation Strategies. Thomson Reuters
Tabora, V. (2018, August 4). Databases and Blockchains, The Difference Is In Their Purpose And Design. Hackernoon. Retrieved 7 February, 2020 from
Treleaven, P., Brown, R.G., & Yang, D. (2017, September 22) Blockchain technology in finance, Computer, 50(9), 14-17. DOI: 10.1109/MC.2017.3571047
van Rijmenam, M. (2019) How Blockchain Will Improve Your Big Data.
DataSeries. Retrieved 17 February, 2020 from
https://medium.com/dataseries/why-blockchain-will-improve-your-big-data-4ddbd37676a0
Vo H.T., Mohania M., Verma D., Mehedy L. (2018) Blockchain-Powered Big Data Analytics Platform. In: Mondal A., Gupta H., Srivastava J., Reddy P., Somayajulu D. (eds) Big Data Analytics. BDA 2018. Lecture Notes in Computer Science, vol 11297. Springer, Cham. Retrieved 7 February, 2020 from
https://link-springer-com.elibrary.jcu.edu.au/chapter/10.1007/978-3-030-04780-1_2#citeas
Wheen, A. (2018, May 02) Blockchain is on the rise, so let’s deal with the pitfalls before they damage our industry. Infrastructure Intelligence. Retrieved 17 February, 2020 from
http://www.infrastructure-intelligence.com/article/may-2018/rise-blockchain-and-dealing-pitfalls