Semi-Automated Nonresponse Detection for Surveys (SANDS)

Section 1: Use Case Identifiers

Use Case ID: HHS-CDC-00048
Agency: HHS
Op Div/Staff Div: CDC
Use Case Topic Area: Mission-Enabling (internal agency support)
Is the AI use case found in the below list of general commercial AI products and services?
None of the above.
What is the intended purpose and expected benefits of the AI?
SANDS is a fine-tuned large language model (LLM) designed to assist with human-in-the-loop procedures specifically for open-ended survey responses. Open-ended survey responses vary in quality and are challenging to use effectively because they are labor-intensive to review and therefore cost-prohibitive to use at scale. SANDS is intended to reduce the burden by filtering out clear and obvious non-response such as "This is so dumb, why I am I wasting my time on such a stupid question" or "Because It is true" which provide no value to researchers. This LLM is available on HuggingFace and is not routinely updated, but it can be used like any open-source LLM through secure instances of Python.

Manual curation of open-ended survey responses is time-consuming, often requiring extensive hours to identify themes and review outputs. SANDS significantly reduces this manual burden by providing scores for responses, enabling researchers to quickly compile an initial high-quality dataset for qualitative research. SANDS also flags responses needing further examination, streamlining the review process.
Describe the AI system's outputs.
CDC's National Center for Health Statistics (NCHS) has developed and released a model to detect nonresponses in open-text survey responses. This helps improve survey data quality and question and questionnaire design. The system is a natural language processing (NLP) model that has been fine-tuned on a custom dataset of survey responses.
Stage of Development: Operation and Maintenance
Is the AI use case rights-impacting, safety-impacting, both, or neither?
Neither

Section 2: Use Case Summary

Date Initiated: 06/2021
Date when Acquisition and/or Development began: 09/2021
Date Implemented: 09/2022
Date Retired: N/A
Was the AI system involved in this use case developed (or is it to be developed) under contract(s) or in-house?
Developed in-house.
Provide the Procurement Instrument Identifier(s) (PIID) of the contract(s) used.
N/A
Is this AI use case supporting a High-Impact Service Provider (HISP) public-facing service?
N/A
Does this AI use case disseminate information to the public?
No
How is the agency ensuring compliance with Information Quality Act guidelines, if applicable?
N/A
Does this AI use case involve personally identifiable information (PII) that is maintained by the agency?
Yes
Has the Senior Agency Official for Privacy (SAOP) assessed the privacy risks associated with this AI use case?
ongoing

Section 3: Data and Code

Do you have access to an enterprise data catalog or agency-wide data repository that enables you to identify whether or not the necessary datasets exist and are ready to develop your use case?
No
Describe any agency-owned data used to train, fine-tune, and/or evaluate performance of the model(s) used in this use case.
3,000 labeled open-ended responses to web probes on questions relating to the COVID-19 pandemic gathered from the Research and Development Survey or RANDS conducted by the Division of Research and Methodology at the National Center for Health Statistics
Is there available documentation for the model training and evaluation data that demonstrates the degree to which it is appropriate to be used in analysis or for making predictions?
Documentation is widely available
Which, if any, demographic variables does the AI use case explicitly use as model features?
N/A
Does this project include custom-developed code?
Yes
If the code is open-source, provide the link for the publicly available source code.
N/A

Section 4: AI Enablement and Infrastructure

Does this AI use case have an associated Authority to Operate (ATO) for an AI system?
No
System Name: N/A
How long have you waited for the necessary developer tools to implement the AI use case?
Less than 6 months
For this AI use case, is the required IT infrastructure provisioned via a centralized intake form or process inside the agency?
Yes
Do you have a process in place to request access to computing resources for model training and development of the AI involved in this use case?
Yes
Has communication regarding the provisioning of your requested resources been timely?
No
How are existing data science tools, libraries, data products, and internally-developed AI infrastructure being re-used for the current AI use case?
None
Has information regarding the AI use case, including performance metrics and intended use of the model, been made available for review and feedback within the agency?
Documentation has been published