In this episode of the GovFuture podcast we get the opportunity to talk to Stacey Marovich & Jennifer Cornell who are with CDC’s National Institute for Occupational Safety and Health (NIOSH). Stacey and Jennifer demoed at our GovFuture Forum May event. In this podcast they get the opportunity to dig a bit deeper into the NIOSH Industry and Occupation Computerized Coding System (NIOCCS) that the demoed at the GovFuture Forum. They also discuss how public health departments are making advancements with transformative technologies and specifically how are they adopting and using the NIOCCS as well as transparency versus privacy when it comes to data and specifically PII data.
If you enjoy listening to this podcast please rate us on apple podcasts, Google, Spotify or on your favorite podcast platform. Also, if you’re not already consider becoming a GovFuture member to take advantage of all the community has to offer including Access to a diverse network of government innovators, Opportunities to collaborate with government agencies, Exclusive access to events and resources, and a platform to have a voice in shaping the future of government innovation. To sign up go to www.govfuture.com/join/.
Trimmed Episode Transcript: (note there may be transcription errors or mis-attributions, so please consult the audio file for any potential errors)
for today’s podcast, we’re so excited to have with us Stacey Marovich and Jennifer Cornell who are with CDC’s National Institute for Occupational Safety and Health, NIOSH. So welcome Stacey and Jennifer. We’re so excited to have you on our podcast today.
[Stacey Marovich] Thanks so much, Kathleen.
[Jennifer Cornell] Thank you. Yeah, we’re excited to be here. Thanks for hosting us.
[Kathleen Walch] Yeah. And for anybody that was able to make our May 2023 Gov Future forum event, you may have seen Stacey and Jennifer there. So we will have them do a little bit of a recap and then that also is available on our site that we will link to in the show notes as well so you can check out what they demoed there and Stacey was on the panel. But we’d like to start by having you introduce yourself today to our listeners and tell them a little bit about your background and your current role at the CDC. Stacey, we’ll start with you and then Jennifer will follow up with you. Great.
[Stacey Marovich] Thank you. Yeah, so my name is Stacey Marovich. I’m a lead health informatics scientist with the CDC NIOSH, National Institute for Occupational Safety and Health. I’m the IT team lead in the Health Informatics branch within the Division of Field Studies and Engineering. And I’m also the project officer for the NIOCCS project, which is our NIOSH industry and occupation computerized coding system, which is what we’re going to be talking about today. I’ve been with NIOSH for a little over 20 years now.
And I kind of came from the IT side and kind of fell into the health informatics part of it. So with that, I’ll join by my call. Jennifer Cornell, so I’ll turn it over to you Jennifer to give a quick intro of yourself.
[Jennifer Cornell] Thank you Stacey. My name is Jennifer Cornell. I’m a technical information specialist with NIOSH. I’m also a part of the IT team and NIOCCS project in the Health Informatics Branch within the Division of Field Studies and Engineering. I oversee the workflow for existing NIOCCS projects and our professional industry and occupation coding team. I’ve been with NIOSH about three and a half years.
I have a legal background, but it’s really exciting to be in public health and part of all this innovative work. So thank you.
[Ron Schmelzer] Fantastic. Well, one of the great things that we saw a demo at the May future forum that Kathleen talked about was a really interesting auto coding system. It’s interesting because this idea of auto coding comes up quite a bit. So at the May 2023 future forum, you demoed the NIOSH industry and occupation computerized coding system, NIOSH. So maybe you could provide a high level explanation of what this is.
And of course, with auto coding comes this idea of artificial intelligence and machine learning. So this is the exciting part. I know one of you wants to get started here and talk a little bit about what you demoed and maybe give a little background and maybe even what was the motivation for the project and where you got started.
[Stacey Marovich] Sure. Yeah. So I’ll take that one. So yeah, our NIOCCS tool, we call it NIOCCS for short. So it was developed by CDC NIOSH. It’s been around for a little over 10 years now.
And we’re currently on our fourth major version of the system. So what the system does is it translates industry and occupation text data into standardized codes so that they can be used for research and analysis purposes. And it’s free and it’s publicly available on the web so anyone can use it.
And our latest version also uses machine learning, which I’ll explain about further. But essentially NIOCCS provides an easy way to modernize information systems to enable rapid assessment of how people’s jobs impact their health and safety by making coding of work related data simple, fast and consistent. So for example, during the COVID pandemic, NIOSH was used to identify and study workplace outbreaks that I’m sure you recall hearing about in the news like at meatpacking plants, healthcare facilities, schools, etc.
So the NIOCCS platform itself has multiple features. So we have features for coding batches of multiple records after data collection has occurred. So think of like survey data or vital statistics. We also have capabilities for coding single records in real time as the data is being collected. And we actually have a NIOCCS web API that can help facilitate that. So as I mentioned, we are using in our latest version, we’re using machine learning to auto code industry and occupation free text into codes. So the primary building blocks of the NIOCCS auto coder are word vectors and neural networks. So just to kind of explain those quickly, word vectors are used to encode text for use in machine learning algorithms by translating free text narratives into numerical values. So in this way, word vectors are an attempt to mathematically represent the meaning of words by clustering words as points in a multi dimensional space arranged in accordance with the way their associated words are used. So points representing related words are spatially closer apart. So an industry and occupation example of this would be, you would expect to see word vectors for words like waiter and waitress close to restaurant.
You would expect teacher and school to be closer together, but more unrelated words like construction and waitress, you expect to see further apart as word vectors. So as far as how the NIOCCS auto coder works at a high level, you input the industry and occupation free text, and then we do some pre-processing to clean the input data by removing, you know, some noise, non-content words like “the, my, a,” and we also do some correcting of some of the spellings. And then we, the industry and occupation input text are each separately matched to a dictionary of known words, and then converted to an industry phrase vector and an occupation phrase vector, respectively. And just note that a phrase vector is really just the same concept as a word vector, only applied at a slightly higher level to account for multiple words that we, you know, see, may see in the industry or occupation text. So then these INO phrase vectors are then fed into two neural network models, one for industry and one for occupation, which have been trained to classify industry and occupation categories. And note that both the INO phrase vectors are input separately into each of these neural networks, so into the industry as well as the occupation neural network. And this is because the content of the industry phrase may impact the coding of the occupation phrase and vice versa. So an example would be something like if you had an occupation of secretary, the coding of secretary as an occupation will differ depending on, you know, what the industry phrase is.
So if it’s a law firm versus a hospital. So then the NIOCCS auto coder outputs the predicted probabilities for all industry codes and all occupation codes for these text inputs. And the auto coder returns the highest industry code and the highest occupation codes as outputs. So we’ve seen many benefits from applying machine learning methods to industry and occupation coding. First coding speed has increased significantly. We can now auto code thousands, tens of thousands of records in just a matter of minutes. The system also requires less manual maintenance and is more adaptable to changing trends in the job market to be able to recognize newer terms that pop up, you know, things like DoorDash, Uber, Lyft. And this technique also results in improved coding uniformity by reducing the human factor that can lead to contradictory selections. And all of these benefits increase CDC and others ability to code more data for occupational safety and health research. Thanks.
[Kathleen Walch] Yeah, I can imagine that, you know, this is saving so much time, right? And probably maybe reducing errors, kind of keeping things uniform. So there’s so many benefits that happen with this. But, you know, how are public health departments making advances with these transformative technologies?
And maybe specifically, how are they adapting and using this NIOCCS in ways that either was predicted or ways you didn’t predict? You know, one thing about putting this out there is that you whenever you test it, because you’ve created it, sometimes you’ve just assumed people are going to use it in a certain way. And then it actually gets out there and they use it in a different way or in ways maybe you hadn’t thought of before. So can you share with us, you know, how it was used in ways that you had predicted and maybe in some unexpected ways?
[Jennifer Cornell] Thanks, Kathleen. I can take this one. Since the release of the latest version of NIOCCS, we’ve seen major advancements in the collection and coding of work data. And as Stacey mentioned, NIOCCS coding speed and capacity have increased significantly. I’ll share a couple of examples from our public health department partners. One jurisdiction reported they used the NIOCCS Web API extensively during their COVID-19 response.
They coded their health data for weekly fatalities collected from workplace outbreaks and survey data. The outcome was a quick COVID-19 response. It reduced training time for new staff that were pulled in to help with the response and modernized high throughput, high quality INO. Another jurisdiction reported they used NIOCCS together real-time coded INO data on various communicable disease interview forms, and this made the rapid dissemination of essential workplace measures to local and tribal health departments possible. We anticipated jurisdictions would integrate the API as designed directly into their surveillance systems. The API branched out from COVID to begin collecting health data for other infectious diseases and use cases, so we were very pleased to see that. A few positive outcomes we didn’t predict included cross-checking of coding by our users. We have limitations on how much we can test because we only have access to our internal data, so it’s very, very valuable to have external validation.
Another user reviewed probability numbers, and this task was above and beyond what we expected. And then further, private entities and nonprofits used the API to collect standardized codes for workers’ compensation data. We would love to see NIOCCS used more broadly in other scenarios beyond just public health, and we’re excited to see where it goes. Thank you.
[Ron Schmelzer] Yeah, very exciting. I think that’s the interesting thing about all of these models. People see that these systems can automatically do things, especially when it’s around language. And the reason why we get so motivated by that, whether it’s chatGPT, answering some interesting questions, generating some great texts and some responses that are really pretty amazing, is because we speak. This is like the thing that humans do. We communicate, right? And when we see machines doing it, we’re like, that’s a pretty cool trick. Machines can do the thing that humans can do with all the vagueness and the inconsistencies and the misspellings.
And that is actually why, even though this is not a generative task, the things you’re doing, it’s more the analysis task and coding, it’s still very powerful because when you have a bunch of different people in different parts of the country with different native language-speaking capabilities and they’re all trying to type stuff, the variations of even humans are significant. And machines have just traditionally been so literal. It’s like, well, you typed pizza delivery, and then the other one says like, food delivery, these are very different things. We’re like, okay, pizza delivery, it’s the same thing. Why are you giving me this grief? It’s because machines are literal. So when we add these models, we start to say, oh, we can handle all this variability. It’s very powerful. And we’ve seen many applications of this. And even a couple years ago, some folks have been doing auto-coding and many different applications. So it’s really very neat.
And I’m glad to see it in public sector. So this brings up the next question, which is, of course, we’re dealing with data. And one of the challenges with AI is it’s basically a data dependent system, good data, good results, bad data, bad results. Sometimes you can’t guarantee any of that. And of course, we’re dealing with all the other aspects of data, bias and ethics and fairness and transparency and privacy and all the sorts of things.
So in your app, look, in your world, and maybe it’s more relevant or less relevant, how do you balance some of those needs for transparency, the need to protect sensitive information, perhaps some of those data quality and bias issues when you’re dealing with these advanced analytics and AI applications?
[Stacey Marovich] Yeah. So it’s definitely a challenge for us because much of the data that we deal with is health data. So there can be a lot of sensitivities around that. And with many of our projects, we also have fairly restrictive data use agreements that constrain what we can do with the data or how long we can keep the data, etc.
So there’s challenges that also poses with our machine learning models and training data and to be able to persist that knowledge. Although I will say for us, with our program, we tend to, for the most part, just get the industry and occupation data elements without any other PII. So we don’t tend to get other demographics like name, age, race, etc. So from that point, we’re not swimming in PII, but there’s still some sensitivities there because even industry and occupation by itself can sometimes be considered PII either on its own or when combined with other demographic data. Or, for instance, if you have an occupation of mayor and you have city name in the industry, that could be considered PII as well, just the industry and occupation on its own. So when we publish out our data to the public, we ensure that there’s no small cell-sized data that can be construed as PII. So we tend to aggregate up the data to higher levels rather than just posting the detailed data for that.
But I’ll say as of right now, NIOCCS is really, it’s just a tool that’s used to code data. So it’s not like we’re not deciding if someone gets a bank loan or if they’re determining the type of medical care they get. It’s not to that level, but down the road, we could see that if NIOCCS was expanded into more and more areas, the stakes could become higher. So for example, NIOSH has been pushing for inclusion of work information in electronic health records. So work is a key social determinant of health that should be collected because I think everybody agrees that work has a major impact on health and vice versa. So it really should be prominent and available in EHRs, but currently work isn’t consistently collected in medical records in a standardized way. So this is something that NIOSH is pushing for and if that happens, coding of work data might possibly impact clinical decisions port or other areas of clinical care in the future.
So it is something that we have thought about. Health equity is another key priority for CDC and NIOSH. So not all workers have the same risk of experiencing a work-related health problem even when they have the same job. So looking at occupational health and equities is something that NIOCCS is used for to facilitate research in this area and promote outreach and prevention activities to identify and reduce health inequities among workers. So in the interest of diversity, equity, and inclusion, actually one of the next big things we’re working on with NIOSH is to expand our machine learning approach to be able to recognize and code Spanish data, which will allow Spanish-speaking workers to be more fully integrated into surveys and other surveillance and research studies to better assess their occupational health burden so that it can be documented, recognized, and or remediated. And also, we try to be transparent about how NIOCCS works because obviously it’s, what’s going on under the hood is pretty sophisticated. We have a, we’re currently actually writing a manuscript that is going to talk in depth about our methodologies, how we applied auto coding to industry and occupation data, how the system works so people have a better understanding of everything that goes into our processes for vetting our data.
We have a team of professional data coders that we have in-house that code and validate data from our coding projects that’s then used to train our machine learning model. So there’s a lot going on, but yeah, we’re just trying to get the word out about NIOCCS and get it as broadly adopted as possible.
[Kathleen Walch] Well, perfect. And we’ll definitely make sure to link to it in the show notes as well, so that all of our listeners can go there. Now, I know you had talked about how eventually you wanna get Spanish in there. What language is this coming in? Is it just coming in English or do you get Spanish coding as well?
[Stacey Marovich] So right now, we do get some amount of Spanish data kind of commingled with the English data, especially from states or jurisdictions that have large Spanish speaking populations. And right now the system, it can obviously machine learning, we are able to code more of that than we have been in the past with previous versions, but it’s still not really tuned or designed to recognize and code Spanish. So that’s something we’d like to enhance the system to be able to do. And it would also open doors for us to code data from jurisdictions like Puerto Rico where their data is collected natively in Spanish. So yeah, it’s definitely something we’ve been wanting to do for a long time. And we’re looking forward to being able to enhance the system to do that because it’s a very critical part of, like I said, our diversity, equity and inclusion initiatives and health equity.
[Kathleen Walch] Yeah, and that’s great to hear. I know that when we talk to other agencies as well, for example, if they have chatbots, they build it in English, but then they talk about wanting to expand to different languages and Spanish is usually the second language that they will expand to and that there can be some trickiness to that, right? Because you just have to make sure that you have enough data to train it depending on if it’s AI or not. And all of that that’s involved in it, but it’s great to hear that this is something that you are thinking about. And maybe in version five, we’ll have that, who knows. So, I mean, this was such a great podcast. We really enjoyed having you both at our event and this follow-up podcast where we could dig a little bit deeper into some of these questions.
We like to end our podcast with the following question because we get such a diverse response. I think because people are able to bring their own background and knowledge into this, as well as just different perspectives. So, what do you see or hope to see as the future of technology and innovation in the government?
[Jennifer Cornell] Currently at the CDC, there’s a priority on data modernization that emphasizes better interoperability and faster data sharing to permit those actionable insights for decision-making. NIOCCS ties very nicely into this initiative by enabling the ability to code work data in real time at the point of collection, which recently, even a few years ago, seemed like an impossibility, given the complexity of the information.
And as you said, all the ways a person can describe their work information in a free text box and just a couple of words. Another part of the data modernization initiative is to lift some of those constraints that Stacey talked about around infrastructure and available software tools, along with streamlining data use agreements. The future of technology and innovation in government, we believe hinges on upskilling existing staff and hiring technology personnel. This includes an ongoing investment in data infrastructure and really recognizing the value of data for our mission, service, and the public good. The vision is to create one public health community that can engage robustly with healthcare, communicate meaningfully with the public, improve health equity, and have the means to protect and promote health. We hope to see NIOCCS as a vital part of the vision by enabling meaningful use of work data to identify patterns among groups of workers that can really be used to inform interventions to improve health outcomes. And that was the motivation behind the project. And that is our goal as we continue to enhance the system.
[Kathleen Walch] All right. Thanks so much, Jennifer. And Stacey, we’d love to get your thoughts on that.
[Stacey Marovich] Sure. Yeah. I mean, so I just think that we’d like to get the data coded as close to when it’s collected as possible, kind of like what Jennifer reiterated. And then basically, I think then everyone kind of downstream in the data flow, you know, benefits from that if the data can be coded, because obviously what we’ve seen with industry and occupation data, especially is because it is so complex, it either hasn’t been collected at all, or it’s just been collected as free text or electronic health records.
It’s been kind of buried in clinical notes and in ways where it can’t really be used in meaningful ways. So, you know, I think we’ve come a long way with NIOCCS and especially with the machine learning technology to be able to make real time coding happen and make that a reality. And, you know, just so that obviously we’re all about, you know, having people’s work factored in, because obviously, like we spend, you know, the bulk of our waking hours are spent at work. So, you know, it obviously has a key impact on people’s health. And we just really want to see that, you know, shine a light on that. And we did see that with COVID where, you know, it was emphasized. And, you know, because of having NIOCCS and luckily, you know, we were in a pretty good place with the technology to be able to, you know, code the data. And, you know, so we’re ready.
Hopefully there won’t be another pandemic anytime soon. But, you know, I feel like we’re in a good place technologically to be able to handle, you know, to kind of scale up to code more data and code it faster and in ways that it can be used for actionable insights. And, you know, but obviously, as, you know, Jennifer alluded to, you know, we do have some obstacles that we need to overcome, particularly around resources, personnel finding and being able to hire, you know, good IT people or being able to upscale the existing staff that we already have.
But that is that those are areas that CDC, you know, is aware of and actively working to kind of mitigate. So, you know, I think that future is bright. It’s, you know, I feel really good about where we are with NIOCCS and what lays ahead. And we have, you know, a lot of exciting plans as we’ve kind of talked to. And so, so, yeah, it’s just a really exciting, exciting time and just really happy to be part of this project.
[Ron Schmelzer] That sounds great. Well, you know, one of the interesting things we’ve been talking about with some of our members, the GovFuture membership community here, especially our Gov folks, is that there’s more attention being put on some of these issues of data sharing, data harmonization, model sharing. And you’d think that actually that would be a super hot topic now, given that we’re talking about AI and automation and big data and cloud and all sorts of stuff.
But interestingly enough, a lot of the conversation around the data sharing in the government actually has quieted down quite a bit since maybe it’s peak and say the early 2010s, you know, maybe 10 or 15 years ago when data.gov was a new thing and open data and open Gov stuff was, was like the popular movement. And that was mostly really about creating data and making data available and things like API’s and data downloads, which there definitely are more available. But now that we’re starting to build these models and now that we’re starting to do more interagency sharing and even sharing outside of the agency with other parties and partners, there’s desire now to say, hey, let’s address a lot of those concerns here.
Maybe the model that you guys built at NIOCCS can be shared with other agencies or other groups, maybe other group people can contribute to it. Maybe there’s an open source ability, maybe there is, maybe there isn’t. We know there’s some issues around data sharing and things like that that may prevent that, but there may be opportunities here. So one of the things I’m going to point out to our listeners, especially if you’re involved in the government innovator community interested in this conversation around perhaps bringing back some of these ideas or surfacing more of these ideas around data sharing and data governance and data standardization or harmonization or data model sharing and things like that. We’d love to have you join us.
So go to gofuture.com/join. For Gov, we want as many folks to join us. This is not a push for an expensive membership or anything like that. That’s not the point, especially a lot of our Gov and military folks can join at no cost. So really, we want you to be part of this community. We want to address these very important conversation points that may not be as a sexy, I guess, as it is when there’s other things getting popular attention.
So without digging deeper, we could definitely spend a lot more time here. I really want to thank you both for your interesting insights, your great contributions and for participating and sharing everything with our Gov Future podcast audience and with our listeners.
[Stacey Marovich] Thank you so much. I appreciate the opportunity. yeah, thanks so much for having us on.
[Jennifer Cornell] It’s been great. Thank you for the opportunity and really excited to continue our partnership and listening to future podcasts as well.
[Kathleen Walch] Awesome, we are too. And yes, thank you so much.