Episode 6: DevSecOps in a Data-Driven World
Sanjeev Sharma: Hello everybody. This is the Data Company Podcast. I'm your host, Sanjeev Sharma. Welcome to a brand new episode. We have a very distinguished guest today. I'm very excited about this one, Hasan Yaser, from Carnegie Mellon Software Engineering Institute. Now Hasan has two jobs at Carnegie Mellon. He is a technical director of the Continuous Deployment of Capability in the Software Engineering Institute and also a member of the faculty. So Hasan, welcome to the podcast.
Hasan Yasar: Thanks for having me join. Thank you so much. I'm glad to be talking with you.
Sanjeev Sharma: Excellent, excellent. I must share with the listeners, you know, the context in which you and I met, right? And that was around your work with Dev, in the DevSecOps movement. And you've been involved in the DevSecOp Days series, right? You hosted one in, in Istanbul and going to host one again. So can you tell us little about what you've been doing with DevSecOp Days.
Hasan Yasar: That started almost two years ago, DevSecOps Days, and it started from San Francisco with, as part of RSA conferences. Now at, globally we are trying to get DevSecOps Days everywhere. Last time was in Istanbul. This year we did more than 10 almost, 2019.
Sanjeev Sharma: Wow.
Hasan Yasar: It's growing. So within two years, we got so much attention. Actually we beat the number of days in DevOps Days. It's focused on how to get the security practices in DevOps pipeline, which is including DevOps pieces, and plus getting other thoughts and ideas related to the risk and related to the security, related to any possibilities of merging risk and security together and how to change the company culture with the risk, with the security ne- implement with the right software delivery and the deployment pipeline.
Sanjeev Sharma: Excellent. one of the things about security which you mentioned is security cannot be an afterthought. Security and compliance, right? It has to be in the pipeline. It has to be in the flow. It should not be something you do separately. It should be something which you do as a part of your daily job as you're building and deploying applications. And I think that's the key message there.
Hasan Yasar: Right. That's main message is not really tooling perspective and more about how to build up a team and how the risk team or the security team, they will share the knowledge with the architects.
Sanjeev Sharma: Yup.
Hasan Yasar: And there is a big disconnect, and how the risk management done at the beginning or the, some are in the silos within the dev environments. So developers have no idea what type of mitigations really they have to apply, or what type of, eh, risk associated with the organization or the, the software they are building. It is not represented well, or they don't know. They have no idea. So two things are happening. Either they are spending their time and unrelated problem to their organizational postures, or they're spending more time, which is useless, and they don't know the value about it, or they are going in-depth for any component and not able to connect the dots with the architects with threat modeling.
Hasan Yasar: So with the concept of getting all the dev security people in the world together, getting the security teams, they know their company postures, they know what, uh, techs they're getting it. Getting more knowledge from the security folks and giving back to developers, that's the main task. So when I see that type of movement, I see a lot of data which are, I guess our topic we're gonna talk more, and data is kept floating around and security folks have so much datas. They are collecting from either IDS or IPSS for any type of network monitoring component, which is siloed for them.
None of them has been initiated properly with the dev team, like, what type of attack vectors are getting from which IP addresses, which port number is affected, and what type of LiDAR has been used. Maybe it's very easy for a single application. When you talk about at system, so system level, it's a huge datasets. Then if it's not connected to work what the developer are doing with architecture, then, it's very difficult to find that any alerts in the datasets that's going to feed back to developers. It's kind of like finding a needle in a haystack. It's very challenging to find the right alerts or the feedback that goes back to the developer because it's a big mess, and how it is collected, how it is monitored, how it is stored and how it is shared. It's a big problem.
Sanjeev Sharma: What is this, secure by design?
Hasan Yasar: Secure by design.
Sanjeev Sharma: Yeah, it's a, it's a, become a buzzword, but, truly it should be. It should be a part of your architecture to build it for operating in a secure manner, and you need to don't just say, well, I'm going to secure the perimeter around it to keep and secure the network. The application needs to be secured also work in tandem with all the different mechanisms you have to protect your assets.
Hasan Yasar: When you talk about security by design, there is an element of data. How do you protect the data in that design, and how do you get the mechanisms in the application? When I was teaching a class at the software security, one of the main problem, and students or engineers, they don't have the concept of the hacker mindset. They're trying to put the log file into the same server, which is a one-on-one problem, right? And if things are compromised and what hackers will do, they will take out any log files has been recorded.
So this type of thing is requiring by design. You cannot change the log file location after the fact. You cannot change any, uh, the data elements of the components, and after the fact. That has to be designed at the beginning. So these are the thoughts and mindset has to be in place, not only for and addressing some security needs and also behind the scene and architecture that has to be done as well, handling data. And, uh, we are talking about the CI triad model, confidential, integrity, availabilities. Integrity, confidential is related to the data, and how you get the confidential of the data right now. And then now talking about the GDPR. CCPA is coming up 2020, and it's gonna be a big mandate how we handle the data pieces. It doesn't matter organization you are in, it's gonna be a major problem, and handling properly in application context and how we use for our testing purposes and how we use for any type of a deployment strategist and how we bid in the real time for any troubleshoot.
Also furthermore how we can use that datasets for training, anything for AI purposes. There's an ethics pieces over there, like collecting so much privacy data and now we're going to put into the AI, which is another big buzzword that's keep coming up, the AI ops. It's gonna be challenge right now. So that goes back to the design principles is important at the beginning because you cannot go and change later on. Because when data grow, how you handle the large datasets, and copying from one set, another set, it will take time.
Sanjeev Sharma: Agreed. So it's the perfect segue, right, to going into the core of our conversation. From a data perspective, right, one of the thesis of this podcast is that every company's a data company, and eventually, no matter what kind of technology problem you're trying to solve, you will have to figure out how to solve the data-related challenges whether it has sort of a data agility, data availability or data security, or just your culture around data, right? But as you worked with your clients, what would you identify, are the biggest data-related challenges they are facing when it comes to modernizing their application or adopting DevOps and CICD related to data?
Hasan Yasar: So I'm gonna start at the beginning. Now, it's not every company, actually every person, it's becoming a data (laughing).
Sanjeev Sharma: Every person is a data person. I like that. We should change that buzzword.
Hasan Yasar: That's right. Like, look at the phone sizes keep growing. We have so many pictures. We are collecting. We are collecting so many files. We are not carrying anything about the papers. We are looking for digital notebooks, digital papers with digital books. Everything's digital right now. Think about similar problem that you see in either in a small scale or larger scale. We are always dealing with a problem on handling data, how do you collect it, and where we're going to store that data. And if you need something to find, then how are we gonna find that data in a big, big warehouses and data warehouses, and how are we gonna find that quickly and timely?
In a complex organization, it's becoming more of a problem because storage is one issue. Where are you gonna store it? And second is how are you gonna find the related information in the data mass? And finding out in a timely. I'm not saying like, they lay there. I'm looking for if I need something I have learned right now at the moment, it's a big challenge. Even the basic SharePoint site which had been using even in the business, finding on such a file, it takes time, where it is located, how it is located, how it is tagged. So and that goes back to the architectural components and it is important how we design that pieces in upfront. Management and storage and accessibility is a big problem. I have been seeing it. Especially being told that accessibility, if you have many distributed environment, it's becoming an even challenge how you collect it.
And one more things I would like to mention as well, Sanjeev, especially if you are in a highly-regulated environment such as financial systems or the health systems, they have a different type of compliances regulatory they have to follow up, which goes back to the accessibilities. Everybody has access to data or not, and what type of privacy concern has to be in place. So these are another challenges. Who is gonna access the datasets, who is gonna be maintaining that record of information, and then who's gonna use for what purposes of an information and are there maybe subset of the management, subsets of accessibility? We can talk about it.
Sanjeev Sharma: Excellent, We as an IT industry are great at adopting technologies, we are horrible at retiring technologies. Right? Every old technology which is there and you look at a company which is 60, 70 years old, they have every form of technology from the mainframe, you know, with legacy systems, all the way to the latest cloud-native systems running in the cloud. And what we are seeing is that, every time a new technology set comes, another island of data gets created.
And a lot of these large enterprises I interact with personally, they don't even have good inventories of the data. They know where the data is. They know they'll say, you know, "I have 2.3 petabytes of data." They know exactly, but what's in it? They don't know. They don't have, as you said, a good data model which represents what kind of data is there and what is the relationship between those various islands of data. So as you're seeing this move towards cloud, will we finally see a consolidation that people will bite the bullet and say, "I've got to retire all these old platforms and consolidate my data?"
Hasan Yasar: I'm gonna say what I have been hearing so far. I start to see more of a data engineering concept in organizations. It's not the DB administrator role, more about, eh, data engineer will be responsible for the company's assets and information. Because when you talk about the microservices or any type of implementation, there is a data pieces up, of course every time, and what type of data has been collected and how do they store it, it's requiring a higher-level thinking, in organizational level and also product level.
So one is if I'm building an application, it is both of them, actually. Whoever building application has responsibilities, and the same as for the consumers. So both sides have responsibilities. Whoever builds an application, they have to have some sort of data management policies in the application how they're collecting, and also same as for the consumers, and how they can access the data.
Data is a gold for everybody, same as for the, the vendors and same as for the consumers, and also privacy. We cannot share data publicly about everything, or we cannot combine everything in a single platform, other what we can do. That goes back to secure by design concept. We may say data by design at the beginning. Because we cannot change the data, how we collect it and how we store it has to be done in the design phase has to be done when we are building architecture component has to be designed how we are building microservices. You can build up microservices multiple way. You can combine your data model within the microservices and you can build up your own database island as part of services or the services.
It's happening a lot because it's much easier for developers handle that pieces within the context of the services concept. As consumers, they have to specify what type of data is important for them. And then vendor, whoever designing the software, they have to give mechanisms to collect the data and also, and then share the data with other third parties. So what I see, more about understanding the data type and where it is stored and how it is shared, and we're gonna find the multiple ways of sharing data, either through APIs or some other gateways and stuff.
Sanjeev Sharma: And it's a big challenge, right? Because, most organizations have terabytes and terabytes or petabytes of data, which has not been tagged. And it is too expensive to go back and tag it, so they tend to paint it with a very broad brush and say, you know, everything in this database is at this classification level. Nobody can use it. And then you're not getting the insights, you're not getting the business value which is embedded on that data. It's a huge challenge.
Hasan Yasar: And another challenge actually, using that large datasets, they will level up and for AI purposes, they are trying to train the data for wrong purposes. Since they don't know what the data even is, and they're training the systems, which model is not efficiently enough and because they train it for the wrong purposes.
Sanjeev Sharma: Understanding the data is very important. You can't just point it and say, "I'll use this to train it."
Hasan Yasar: Right. It's basic garbage in, garbage out. If you have a garbage in for a model, we have a garbage out as a model. So the model will not work. This is another problem. Collecting big data is okay, but how we're gonna use it for the AI purposes, it's another issues. And that goes back to the, more about what type of data we are trying to train, what is the purpose of model we are building.
Sanjeev Sharma: And you don't know if the data is clean enough, you don't know if it has the right quality, whether it has bias built into it, whether it has wrong data, which will poison your model, you don't know.
Hasan Yasar: And then, put the pictures as, say, cat the time, but actually pictures is a duck, and the model say the same, duck (laughs).
Sanjeev Sharma: Very true. This has been a very interesting conversation. So now to wrap it up to somebody who comes and asks you how do I, you know, in this jungle of options, pick my path?
Hasan Yasar: Yep. It's a great question. I have been hearing almost every semester about 30, 35 students, and either in my DevOps class or software security class the same question, what I need to know, what I should learn. My answer is don't learn the tools. Tool may change. If you know the, what the tool's purposes of this, then you will learn the tools when the time comes up, but what do you have to learn as an engineer or the young generations? Learn the concept of a deployment.
As an example, if you're looking for as an engineer, more a DevOps-oriented organization, learn about deployment techniques, learn about the infrastructure pieces, learn about the theory, understand the concept first. What is a site or lab engineering concept, and what is the ordination? And if the person is looking for more DevOps things they have to understand automation engineering. How can I automate the things that can be done testing phase, it can, it can be done infrastructure, it can be done testing data sites. So it depends on what they would like to go, the automation engineer or that can be a software engineer or can security engineer.
They have to learn the theory. They have to understand the principles and components, and then they can work in implementation that takes later on. If I am gonna really go to college, then definitely I would like to be an engineer. Again, software engineer because engineering principle is not changing. Implementation is changing, and always suggest that they have to be a good engineer. In the software engineer or electronics engineer, whatever the career they are in, they have to be a good engineer, understand the concept, understand the theory. They have to understand how to develop algorithms, what does data structure means.
When they go to the platform, they will learn the platform, maybe an Azure, maybe a Google Cloud, maybe AWS. Then they will let the moment. It's just that when the times comes up, they will learn that. It is not taught on the YouTubes or at conferences. Unfortunately they have to learn by studying specific subjects.
Sanjeev Sharma: We use the term architecture a lot in software engineering. Does Carnegie Mellon or at any place you know offer mechanisms for people to go study architectures of systems, uh, large application systems?
Hasan Yasar: There are. CMU is offering courses. CMU has Software Research Institute as well, and that's heavily oriented towards building of architecture, complex architecture systems, so system level, and using a cloud as an example, or using a different environment. Actually, you said the perfect point and when I was in Istanbul with a bunch of security folks who came to the DevSecOps Days, they look at the buildings, they said, "Wow, architects build the building. It is still alive more than 2,000 years old."
In this day and age, we are building a software and architect is bad. Even the single margins we are building, it is not scalable. It is not reliable. It's bothering me. That goes back to engineering principle, that we, we are building something very fast, very quick. I'm not against that fast, quick, but what I'm saying, we should think about more resiliency on the code itself. We should think about more broad and more useful, longer-term need for any components.
Sanjeev Sharma: Absolutely. We have to keep in mind that technologies will change over time, but can your architecture be resilient enough to handle the technology.
Hasan Yasar: That's right. I was at one talk yesterday, I heard, the person was asking the COBOL with the DevOps. He said, "If you're able to get that COBOL architecture to modular, incremental phases, it doesn't matter what platform you're using it." Can you do that? Yeah. Okay, that's fine. Then that means your architecture is resilient, it's repeatable, incremental enough, then you use that. That's going back to our data discussion again, that is important to build up at the beginning and architecturally and the system-wise.
Sanjeev Sharma: Absolutely. That's a perfect place to end this podcast. This has been a great discussion. Thank you so much, Hasan, for joining it. This was a great conversation I'm sure all the listeners would enjoy. This is your host, Sanjeev Sharma, for the Data Company Podcast, signing off.