So many top buzz words are wrapped into a single title because I am moderating a panel session on Thursday at the Fujitsu Technology Forum 2013 on exactly this topic with some members of the Big Data working Group, of the Cloud Security Alliance.
The Cloud Security Alliance is a “a member-driven organization, chartered with promoting the use of best practices for providing security assurance within Cloud Computing.” Formed in 2009, it has expanded to having 35,000 members, 60 local chapters, and 20 main initiatives, four of which call themselves working groups.
The Big Data Working Group (BDWG) was formed in 2012 to “to provide industry with leadership, research and guidance in identifying scalable techniques for data centric security and privacy problems” Fujitsu Labs played a key role, along with Ebay and Verizon, in the founding of this group.
The primary concern is that many of the normal security and privacy issues are magnified by the nature of big data. When data is in a constant stream to or from a data store, the standard approaches for finite, static stores does not work. The top 10 challenges identified are:
- Secure computations in distributed programming frameworks – consider MapReduce which spreads a data query across many individual computers. What must be done to assure that reducers are talking to right and proper mappers? How can you assure that the final result does not contain values that might be needed initially to do the query?
- Security best practices for non-relational data stores – While SQL security risks are well known, NoSQL is so new that there might be new risks that have not been discovered.
- Secure data storage and transactions logs – The shear size of data involved requires tiering and manually moving around on a regular basis. (e.g. one system fills up so you break the data across two, and have to move half the data.) How do you assure security before, during, and after the move in a consistent way?
- End-point input validation/filtering – So much data is flowing, it can be a challenge just to assure that you are really connected to a legitimate stream of data. how would you detect being connected to the wrong stream?
- Real-time security/compliance monitoring – any kind of security or compliance monitoring will be made more difficult by the volume of data. Luckily the point of a big data system is to provide very fast calculations on that data. This might save the day, if you can restructure those monitoring operations into a form that can run distributed.
- Scalable and composable privacy-preserving data mining and analytics – the potential for privacy invasion is multiplied by the volume of data, and exponentially so when it comes to making sure that data can not be gathered from different places and manipulated to reveal private data.
- Cryptographically enforced access control and secure communication – Encryption is normally an all-or-nothing deal, and on a single, closed system that will be fine, but the nature changes when you have many machines working in parallel. Do all the system that are cooperating use the same cryptographic key, which increases the risk to exposure, or do they use different ones, which can increase the complexity of interaction?
- Granular access control – fine grained access control is important so that parts of the data can be made available, and not be grouped in one big lump. Management and control problems are magnified.
- Granular audits – regular detailed audits are similarly challenged by the size of the data set, and by the complexity of how the data has to be handled and moved.
- Data provenance – tracking when a record is created and updated is a certain level of challenge when the data store is static, but increases quite a bit when large data sets are distributed across a lot of computers.
This work is just beginning. I am hoping to learn a lot more, when working group members Taka Matsutsuka, Praveen Murthy, and Amab Roy come on stage to present their current research in this area.
Wish me luck. The session is right after Nicholas Negroponte, and before Alan Kay, so these will be tough presentations to be compared to!
If interested, watch the linked video.