podcast

Episode 3: Building Software with a Modern Data Strategy

In our third episode, host Sanjeev Sharma sits down with IBM Distinguished Engineer Andrea Crawford to talk about making data a core competency, modern software delivery techniques, and why data science is a hot space and career path to be in today.

Sanjeev Sharma: Hello everybody, and welcome to the new episode of the Data Company Podcast. I'm here with Andrea Crawford from IBM, and Andrea is an ex-colleague of mine from my time back at IBM, and a Distinguished Engineer over there. But I'll hand it over to Andrea. Andrea, why don't you introduce yourself.

Andrea Crawford: Sure, thanks Sanjeev. My name is Andrea Crawford, and I am a Distinguished Engineer with IBM. My focus area is DevOps and in particular DevOps in the context of garage solution engineering. IBM has this methodology around modern delivery on modern platforms with modern teams, and we call it IBM Garage. One of the fundamental tenets of Garage is DevOps, so that's sort of my specialty area.

Sanjeev Sharma: Excellent. As this podcast is called the Data Company Podcast, the thesis under which we started this podcast is that every company is a data company. There's not a company in the world which doesn't have data. What has changed in the recent past is what they are able to do with the data, the kind of insights and inferences they can get. As you are helping customers adopt DevOps and more importantly, adopt the Garage method and applications are being modernized or brand new cloud-native applications are being developed, what do you see when it comes to the challenges clients face around data?

Andrea Crawford: I talk a lot about application modernization and how to infuse more modern delivery techniques in terms of how software is delivered. But a conversation that also has to be had is what is the modern data strategy that these clients are going to take going forward? In the context of more conventional, traditional ways of dealing with application data, we see the typical relational database management systems. We all know the products, the Oracles, the DB2s of the world, are pretty much a mainstay with a conventional data product set, right? And therefore, as we start to change the conversation to more of a microservices, rest APIs, and leveraging architectural styles that really unleash the power of cloud and containers, we're actually seeing the world of data open up a bit in terms of going beyond relational database management systems.

There was this whole movement, you remember, a while back around big data, right? Back then, it was all about structured and unstructured data, which is opening up the product set in terms of where data is going to be found. It may not necessarily all be found in relational database management products. This notion of unstructured data and then this whole no NoSQL, and geospatial data and the like, and even being able to identify situations where perhaps data shouldn't be in a relational database product, but can we leverage things like database-as-a-service and caching services in the case where data needed to be accessed with different service level qualities? But also combining that with the microservices movement and being able to vertically slice a lot of these monolithic applications so that we're not just creating microservices around a monolithic database. Because you have to break down the monolithic data underneath it, or else you won't achieve the kind of success that our clients are looking for.

Sanjeev Sharma: Absolutely. You might have microservices, which are now so entangled by the singular monolith, which becomes like aN old inflection point, which everybody has to go through. But one of the challenges, and I'd like to get your input on this, which I see when I'm talking to customers, is they really struggle to even classify data. When you talked about having different service levels, what's the prerequisite for that? I have the ability to first inventory all my data. I have the ability to segment all my data based on the classification level to determine what service, quality of service, that data needs, and that classification could be based on how sensitive is that data? Do I need to mask it before I make it available? Is this just telemetry data from an application or an IoT device? Or does it have PII and PCI data in it?

Andrea Crawford: I would totally agree with that, Sanjeev. In addition to all of those things, especially around regulatory compliance, we also have to talk about the behavioral characteristics of the consumption of the data. Back in our core enterprise systems, you tend to have a lot of requirements around consistency, transactional integrity, and the like. That tends to be a very core set of characteristics.

However, when you start to get into, for example, mobile apps, and you start to get into data elements that don't necessarily need to have the type of transactional integrity on the edge, then you can start to open up your options around caching services that don't necessarily have to have the same rigorous or stringent transactional integrity that your core systems would have. Therefore, you can open up the possibilities of introducing other types of data services that will meet the level of service that you want, get the performance that you need, without compromising the integrity of the user experience itself.

Sanjeev Sharma: Excellent. I think that's the challenge. Most organizations still look at data with a very broad brush, right? It's there, it's in my system of record, it's in a database. I was recently talking to a bank, and they're like, “Forty-three percent (or some number they came up with) of our data is actually still on a mainframe, but more than 95% of applications touch it directly or indirectly." But that data is being painted with a very broad brush. It's all or nothing. There is no concept of there are parts of the data which are different, and I can cache it because I need it for the performance. I can have a different quality of service level for that particular data.

One of the blockers, and you and I were joking beforehand about this whole term DevSecOps.  One of the blockers, which is forcing data to be looked at with such a monochromatic view, is security and compliance. Everybody's looking at it saying, "Oh no, these are our crown jewels. Our data is there, we need to protect it." There are some very stringent, old school rules, and compliance requirements or audit requirements around the data. In the DevSecOps world, the whole idea is you bring security and compliance into the process. What are you seeing in that space? What's the evolution you are seeing, or what's the improvements you are seeing?

Andrea Crawford: It's an interesting problem to me. On one hand, you have the whole area of secure, design, and development. So are you actually designing and coding applications that have security built in? That's one area. Another area is security in the context of how your code is being delivered, and is it following a set promotion path with the right kind of entry and exit criteria? Are we auditing all of the significant events along the line that auditors can use to say things like, "Is this an immutable set of deployment packages and that you're not changing something right at the very end?"

But when it comes to data, this is really interesting too. There's this whole segment around data in terms of data that you're collecting in terms of how things are being designed and also delivered. So being able to do continuous improvement to make sure that you're getting better, and that you can prove that you're doing all this scanning, testing, vulnerability checks, and all that stuff along the way. When it comes to application data, you're going to find all the things around your identity and access management, your regulatory compliance, and qualifying what is SPI, what data needs to be under certain scrutiny based on GDPR or HIPAA or FISMA for federal.

There has to be a recognition that you've got regulatory compliance that has to be considered, identity and access management. Then, it's sort of interesting because we're getting into cognitive, AI, and how we can use some of this set of core data to gain actionable insights in terms of data trends. There may be some cases where you have to obfuscate or mask data, so that you're not in violation of any of these regulatory compliances and the like. But you also need to have test data sets that are available for integration testing, performance testing, and end-to-end function testing as well.

So how do you get that? They would ask, "What is the recommended way to get a solid test data management program in place? Today, we copied production data.” Whereas the opposite end of the spectrum is you want to be able to manufacture a subset of data that you will need for the kind of testing that you're trying to do. There's pros and cons and variations in between to all of those things. If you're duplicating production data, you could be duplicating the risk and exposure of your production data set.

Sanjeev Sharma: Absolutely. If you are copying production, you need to make sure you mask and obfuscate it. And if you're using synthetic data, maybe it is to test outlier scenarios which do not exist in production, either shouldn't exist in production since they're outliers, or to test new features. Like, you're just adding geolocation to your mobile app. Well, there's no geolocation in your production data because you didn't have it in your mobile app. You have to use some kind of synthetic data to be able to test a new feature. So, absolutely. It’s a broad spectrum.

I'm glad you mentioned AI and machine learning because that, to me, is an emerging use case in the DevOps world. As you go from a hundred delivery pipelines to now microservices, going to thousands of delivery pipelines, you have to introduce a lot of automation and a lot of machine learning to be able to digest all the info, even if you're not acting on it. To just digest all the data which is getting created by these delivery pipelines. How are you seeing that evolving? Do you see AI and machine learning getting adopted as a part of your application delivery pipeline?

Andrea Crawford: That's the trend. I mean, I don't know if you've heard of AIOps. We're starting to see all these permutations of DevOps. We just talked about DevSecOps a little bit, but there's also like GitOps and AIOps and BizDevOps and ChatOps. All of these different permutations of the DevOps domain, if you will. I think that from what I'm seeing with clients, they have pieces of the telemetry story. They may have the significant events along the ancestry or lifeline of a feature, but they're interspersed between all the different heterogeneous tools that you would see in your tool chain.

I think what we're going to see is more tooling in the space around end-to-end software delivery life cycle management and being able to connect some of these tools in a way where we can see the story, the lifeline of how a feature came from ideation all the way through to production and the management of that feature and its operational runtime environment. Once we have the data, the significant events, this feature was committed to this branch and then it was pull request and then a merge and then it got tested and built and scanned and all the way down. Once we're able to gather these significant events, then we can start to use things like artificial intelligence to really understand the value stream of how efficiently a feature got from ideation all the way through to production, deployment, and beyond.

Once we see the data, then we'll be able to draw insights and trends from some of the patterns. The long pole in the tent. We'll be able to see it from empirical data that was gathered, and then we'll be able to draw insights like, "This code was submitted as a pull request, and it sat there for five days." This is indicative of perhaps a behavioral change that needs to be made within the squad or the development team itself. We need to be integrating, hopefully on a daily basis, the code into branches and what not.

Sanjeev Sharma: Right. What happened here? Why is this an outlier scenario? But today you won't spot it.

Andrea Crawford: But today you won't spot it. It's very difficult for clients to spot it because what they're trying to do is now clients are understanding the value of getting empirical data in terms of how they're delivering software, and now they're having to retrofit it in situ. They're having to go back and try to pipe into their tool chain to get the data.

Sanjeev Sharma: Right. One of the challenges there is that, first of all, there's a lack of standards on how to share that data and to make and normalize it to be able to be consumable. And secondly, even from a basic agile practices perspective, there seems to be still a lot of religious wars going on as to what is the right practice. The garage method has its own approach to it, but I think all those will need to normalize a little bit more before we are really able to understand what this data means. Even then, it'll be very different, not just from company to company, but probably from project to project within an organization.

Andrea Crawford: That's right. It really makes the whole notion of observability across the portfolio that much more challenging.

Sanjeev Sharma: Now, as we wrap things up, there's been a lot of focus on women in STEM. If somebody is new in the industry or if somebody just looking to change the path they're on, to answer it from the lens of a woman engineer. What advice would you give somebody to help figure out their career path?

Andrea Crawford: One of the things that I've actually carried through from my college days, eons ago, was that I learned how to learn. I have always had this technical curiosity that was very much nurtured, even when I was in a young lass by my parents. I gotta tell you, Sanjeev, the whole career path around data science is hot. It is all about problem solving and using data as your natural resource in order to gain those types of insights that you may not be able to really understand if you were just looking at numbers.

I would highly encourage those who have a technical curiosity and a passion for data to really get into it. Now, if you want to talk about specifics, like where do I start, there's a lot of really interesting work happening with analytics. That's a really cool and interesting place to be, so the whole observability problem set and being able to understand through data, how you're delivering software, and ultimately in the operational environment. It's a hot space to be, but I think any of those areas that one would choose, it's hard to go wrong.

Sanjeev Sharma: Excellent. Excellent. Thank you. I'm sure people will really benefit from that advice, and it was a pleasure having you. This was a great conversation. Thank you for your time.