This is a transcript of episode 179 of the Troubleshooting Agile podcast with Jeffrey Fredrick and Douglas Squirrel.
Squirrel tweets about isolated data-science teams, and Jeffrey tells stories of close collaboration between engineers and data scientists, even pairing!
Show links:
Listen to the episode on SoundCloud or Apple Podcasts
Introduction
Listen to this section at 00:14
Squirrel: Welcome back to Troubleshooting Agile. Hi there, Jeffrey.
Jeffrey: Hi Squirrel. I was on Twitter just before we started recording and I saw several tweets from you today, but you’re not normally a multi-tweet kind of person.
Squirrel: No, that’s true.
Jeffrey: What’s going on?
Squirrel: Well, what I discovered is that instead of just writing down five or six words to remind myself about topics we might discuss in the future, I could tweet them! So, I started doing that. So when we’re looking for a topic, I can just look at my tweet stream and other people’s comments there, which seems to be working great. I only started it today, but if you want to check out what might be coming on the podcast, you could look at @DouglasSquirrel.
Jeffrey: Link in the show notes to one of the ones that you tweeted today, because I saw it and said, ‘why don’t we talk about DataSciOps?’
Squirrel: Sure! It’s been brought to mind a couple of times for clients recently, but for several years I’ve seen that data science teams are presently like server teams, operations teams, and system admin teams used to be around 2010. Back then your operations team was this separate entity and it existed somewhere in your pipeline. In the case of system admins, they were at the end of the pipeline. You’d written a whole bunch of code and then these mysterious people in a different building, who you’d never seen or heard of before, did magic that you didn’t understand, and your code went live. That seemed insufficient to very smart people like Patrick Debois, and the folks at Flickr, and John Allspaw, and other folks. And they said, ‘hey, what if we had this thing where the developers and the operations were together and gosh, we could call it DevOps!’ That term has been bastardised in quite a lot of ways since then. Now you talk about hiring ‘a DevOps’ and I never understand what that means. But the original idea was what if these people got out of their separate buildings and stopped treating each other as magicians? That worked out pretty well for the teams that tried it in its original form. I’m not about the modern mutation. But my thought was that data scientists are very similar. They are magicians at the beginning of the pipeline. Typically, they’re coming up with something abstract and new and exciting that’s going to go into a software system of some variety, but they’re coming up with something that many other people in the team don’t understand. It comes sailing in from outer space and lands on your desk and you say, ‘what do I do with this?’ In a similar way, they’re not part of the regular process of the team. They’re not part of the standups or the sprint planning or the retrospectives. I keep seeing this go wrong, and it sure seemed to me like what we were seeing go wrong in the operations world in the early part of the century.
Integrating Data Science Teams
Listen to this section at 04:05
Jeffrey: Your story here made me think of not just sysadmins, but Oracle DBAs and other people with special knowledge who are also key. It occurred to me people want to be informed and data-driven these days, yet their approach is to have these silos. They don’t really have data science embedded in their process, and they’re often lacking the data literacy to be able to actually have a constructive conversation with people.
Squirrel: That was true of the server administrators in the early 2010s. I remember teams where continuous integration, which was new then, was being run by the operations team as some magic thing. You just handed over your code and some tests and you got a test report back. What happened in the between? No idea. But many of us learnt to operate those systems ourselves and to run the release process, how to push a button and release instantly. All those things came to be less magic and I suspect we can do the same with data science.
Jeffrey: I’ve had an interesting experience over the past year or so with a product team where we did have data science people integrated as early as trying to form a hypothesis for the product. We were generating new capabilities, and the belief we had was that some people would be using the existing product to meet the needs the new capability would address. The question was, could we discover that from our data? We had one of our data scientists go and cluster the users based on their usage of the products, starting with just the idea that if you looked at how different users use the product, could you cluster them meaningfully? Would you effectively derive personas from the data? We were able to apply labels to types of behavior and say ‘we have these four usage patterns of the product, and we think that what we are developing matches the behavior from two of them.’
Squirrel: That sounds really useful for making sure that product managers and others making product decisions have useful data to start with. I’ve seen it when you get data scientists involved in the day-to-day work, that teams are then much better driven by product priorities and by customer understanding. The opposite of that is what triggered the tweet: observing a couple of teams where their projects were dreamed up by data scientists, applauded by other data scientists, demonstrated to data scientists, and then plopped onto the desk of a developer or a product manager who says, ‘who ordered that?’ Those projects do not then become part of the ultimate product.
The Need for Common Language and Understanding
Listen to this section at 08:29
Squirrel: I had one situation where the data science team was spending months and months coming up with really innovative stuff. And the data scientists would say, ‘look at this, it’ll solve this problem for customers!’ But the developers, because it was magic to them, would get these Jupyter notebooks handed over to them and say, ‘I can’t even tell how to run it. I don’t even know what I’m supposed to do with the lines and lines of text that have come to me. How do I check this in? Where are the unit tests?’ They were speaking completely different languages. No surprise, those data scientists were very frustrated because there was no route to production for them.
Jeffrey: You can also see this kind of thing where the data scientists are undermined because the data that they were given in the beginning wasn’t going to match what people would have in production. So all the work they did was invalid because it had lookahead bias, because the people provisioning the data weren’t having conversations with the data scientists about what needed to be true about the data. These miscommunications can torpedo very promising things early on, or later when it comes time for integration. But essentially, we’re not getting value because people aren’t collaborating. When you posted this on Twitter and you said, ‘hey, we need a Patrick Debois for the data world, how about DataSciOps?’ Someone said, ‘oh, no, we’ve got that. There’s this DataOps thing.’ I looked through that and it didn’t resonate with me in the way that your tweet did. I think because it focused a lot on the process and the tooling. It seemed to be saying, ‘we’re going to have better data science.’ That doesn’t have the same focus on alignment and embedded collaboration that I took from what you were saying, like DevOps: make them one team, have them there together from the beginning. I didn’t see that in that DataOps thing.
Squirrel: That’s the bastardisation of the of the original DevOps mission, where people conflate it with tools like TerraForm and Kubernetes and containers and all these nifty techniques that make things simpler. That’s what this DataOps stuff seems to be about. Listeners are welcome to correct us, if you know more about it, please tell us, but it seems to be talking about the tools and the methods. I’m in favour of automating lots of things, but DevOps is not about automation, it is about collaboration and demystifying what people are doing and getting true customer-led innovation going. In that case it was with system admin teams. Well, why can’t we do the same for data science? It’s not about the tools. It’s not ‘let’s automate and make it simpler for data scientists to do their job.’ That’s a separate, great thing to do. It’s not what I’m complaining about in the tweet.
The Art of Pairing
Listen to this section at 11:53
Jeffrey: I’m really excited about the idea of developers and data scientists working together. I think there’s a lot that they can learn from each other. There is huge potential in bringing these two things together and I think we can get amazing outcomes. At TIM for several years we had data scientists and developers pair rather than working as silos, and I’ve seen incredible collaboration from that.
Squirrel: Some of our listeners, sadly, may not actually know very much about pairing. We should probably do a whole episode on pairing sometime because people have forgotten about it. But what you mean is a data scientist and a developer sitting at the same computer with one keyboard and producing an output. I hesitate to say code because some of it might be a model, but they’re producing something that makes a computer do something amazing and they’re two people at one keyboard.
Jeffrey: I have a particular pair of people in mind and they did sit next to each other. Now they are still pairing, though remotely, and yes, they create output and they’re partners in their collaboration. They make sure that they have synergy between the code that’s being produced by the developer, informed by the needs of the data scientist, and the data scientist and their code interact with those of the developer. As a result, they’re are able to make much more progress much faster than they would working separately. That’s what they tell me. ‘Look, we’re so much more productive getting this kind of system built than when the modelling was done separate from the development.’ If we roll back in time to a few years ago at TIM when we had our first model come in, it had to be completely reimplemented. We got a bunch of code that was not what we had in production, and then we had to go build a bunch of tools around collecting data, and reimplement a lot of validation to make sure that we’re getting consistent results. There’s a lot of backfill, not value-add work. Once we learnt how to do things better around generating the data realistically from production and essentially partially building the production system first, because you’re building the data pipeline that you’re actually going to use in production from the beginning, that was a huge win. It made the data scientist life easier because they could get more data, better, faster. What came out of it were models that were already built on top of that production data. It was an amazing improvement for us when we got that collaboration in place.
Squirrel: Well, we like to give listeners ideas that they can implement. Pairing may be a kind of radical out of fashion idea these days, but both Jeffrey and I are big fans. But in this specific case, greater collaboration directly between data scientists and engineers, whether you go as far as pairing two people on one keyboard or not.
Jeffrey: If you’ve tried it and it worked, I’d love to hear from you. And if it didn’t work, why not? I mean, I would be fascinated, because I’m sure that we’re not the first people who’ve tried this or thought of it.
Squirrel: Thanks, Jeffrey.
Jeffrey: Thanks, Squirrel.