Skip to Main Content

COVID Data Scraping Scheme

Set up a Hipaa compliant COVID testing web site that has  Data Scrape functionality built in.  

The plan

1. Create a simple html website with COVID Data Scraping Scheme owned by Public Health entity.

2. Anyone with a local computer device with patient/worker/school database, logs on securely to the Public Health COVID Data Scraper.  Once there, they click to  allow website Scraper to connect and parse the desired/required specific  COVID reporting data elements.

3. Must have HTTPS encryption to protect privacy and security.

4.  Parsed data is then further standardized (ie field length, language) for public health and data analysis purposes.

To further explain, it's like a computer "virus" that instead of blocking it,  you allow the "data scrape program"  in to pull out data for the public health. 

Goal is to -

Easy for anyone to login and share their data.

Eliminate need to manually enter information.

Avoid data entry errors

Eliminate need for incoming scraped computer data to be standardized.  One of the most difficult steps in data transfer is to create matched data fields to ensure transfer.  

Standardize data elements for analysis by public health. By creating a tool to standardize incoming data, any computer or device in whatever configuration can provide useable data. The basic idea of web scraping is that we are taking existing HTML data, using a scraper to identify the data, and convert it into a useful format.

 

Full disclosure - I am not a Programmer.  I have been working on smooth medical data transfer problems for most of my  thirty year career in a large reference lab, and more recently in smaller molecular diagnostic labs.  I've worked on large scale national database projects, including public health reporting. I'd like to work with a team of programmers interested in this concept if it seems at all feasible.

You will need to login to post a comment

Alicia Beckett 8 months ago

This is an interesting idea!

Who do you imagine would use the scraper? And can you give an example or two of the kind of website it would scrape?

Reply 0

Marianne Weinell 8 months ago

Hi - Actually it’s the other way around. The general public, doctors offices, people testing employees POC , will log on to a website managed nationally (hhs?). They will only know the site is secure and trusted to “extract” aka scrape their local computer or device for data. The scraping tool resides on the website.

So another way to explain, instead of scraping a website, the website is scraping your private company computer. It’s the reverse of the usual concept of scraping. The idea originates from what labs do to backup lab data.

Example:
Urgent care runs COVID test > Nurse Enters test result in their EMR > Nurse logs on securely to website that has scrape tool> Clicks ‘scrape tool’> scrape program connects to Nurse’s EMR > Scrape tool images, mirrors or extracts data from patient record in Nurse EMR> data is reformatted in website where public health does data analysis.

Reply 0

Jo Anna Hernandez 8 months ago

How would this differ from accessing the data via various Health Information Exchanges (HIEs)

Reply 0

Marianne Weinell 8 months ago

The difference is the scraper program is agnostic and should work universally. The scraper would not need to be set up in advance. HIE have to be programmed to talk to each other, or have data fields matched so e-data can move from one device to the other. This matching activity costs money and time if two systems are not already optimized.

A scraper should work on all EMR or with any file where data can be imaged, mirrored or scraped.

One of the roadblocks I’ve seen withe HIE are limits if data fields are incompatible or require additional programming to allow electronic transfer.

Reply 0

Abigail Watson 8 months ago

Yeah, this is a great idea. Thank you for sharing it and validating interest.

We have some functionality similar to this in the SANER Relay network, in so far as software stack it’s based on can handle a 2 page layout in desktop mode, and we sometimes iFrame external sites in the second page. Use the first page as the encoded measure report, and the second page (iFramed) lets the user select the data elements from the page, and then map them to the report one item at a time. That would be workable. And it’s sort of similar to the ETL extraction project we did a few years ago.

Yeah, we could commit to building out this functionality. Maybe even put together a screen rendering later this week.

Reply 0

Marianne Weinell 8 months ago

I just looked at your description of the SANER Relay network, it does address the concept of an app capable of doing an automated system query to produce a report that can be shared with county, state, federal systems. In the example you posted, hospital records were queried.

Are you saying that if the user in other settings, say an urgent care or school system created a 2 page layout in desktop mode that the 2 page layout could be queried ? I assume the mapping to the report one item at a time would only have to be done once to establish a regular data transfer process?

Reply 1

Abigail Watson 8 months ago

Well, I’m saying that the 2 page mode has functionality and design in place that we could build the scraper mapping / training components rather easily, and then leverage the rest of the infrastructure.

Screen scrapers are useful and fun technologies, but my experience is that they’re brittle and rarely ‘one time’ configurations. Each time the website it’s pointing to changes or is reconfigured, boom... scraper breaks.

But they do have a super important and useful role in automating systems when API interfaces aren’t available. The design challenge is having a simple and easy to use training interface for the scraper, that lets one choose the elements of the page, and then puts it all into a report.

Do you have an example of a school report that might need to be scraped? Somewhere that doesn’t have an EHR?

Reply 0

Andrea Pitkus 8 months ago

Trying to understand how your approach would address the following (In general) 1. How to collect AOEs, from ordering provider/patient/specimen collector. Seems to assume this is stored on patient computer or provider computer? Is that correct? If provider, then how will scraper know which patient/data needs to be scraped to ensure correct data is with correct patient?

2. Integrate into app/LIS or other information source for patient to be married to results of IVD test device/system (either lab performed or patient performed at home like pregnancy test) How would results and AOEs scraped be accurately LOINC and SNOMED CT coded to meet ELR & HHS encoding requirements?

3. All transmitted to public health (ELR)

4. All transmitted to HHS (may be met by 3).

Reply 0

Eric Tsibertzopoulos 7 months ago

How is this web-scraper going to work when data might be inside a database, Excel files etc..? How is it going to handle incremental data loads over weekdays?

Perhaps this website concept can be used for Hospitals/clinics to upload schema-compliant data files for a batch-load process (triggers when you upload your files).

Reply 0

Marianne Weinell 7 months ago

Based on Andrea's feedback, it sounds like the "scrape" approach is not ready for development further. She noted the "fragility" of these programs and propensity to crash - which is true. I'd hoped scrape programs had advanced since I last worked with them in the lab several years ago.

If this were to move forward, what's needed a a system to experiment with:
1. Assumes information from POC device user would alreday be entered and stored in iphone, local provider or workplace/school computer, or yes, hospital/clinic EMR.

2. To know which data needs to be scraped, a key word or phrase such as "SARS-CoV2" or "COVID" would need to be associated with individual name or ID to trigger to grab an individual's data. The data fields transferred would include any LOINC or SNOMED CT associated with the individual. I know most systems don't have this data now, but the program should have it built in to transfer the data if available.

3. Assumes data transmission portals to appropriate State can be executed via using individual's zip code data. Same for HHS data.

4. You are correct, Hospitals/clinics could leverage a universal "scraping app" to upload compliant data within their regular schedule as long as it meets 24 hour data reporting schedule.

Thank you for your thoughtful comments. At this point though what is needed is someone with access to a system willing to experiment. I'm not sure where/how to make that connection in the timeframe we need it for this Design-a-thon.

Users tagged:

Reply 0

Abigail Watson 7 months ago

Happily, some of us are working with rapid-prototyping tools, and were able to do some experimenting over the weekend.

I took our MeasureReport generator page, which is usually a 2 panel layout, and converted it into a 4 panel layout. Pulled in the IN.gov coronavirus website as a sample page to iFrame and scrape.

Now then, here is where things get a little wonky. The April CDC NHSN reporting requirements had measures like ICU bed count, ventilators, etc which are already summarized data, and nicely modeled by the FHIR MeasureReport resource.

The Cares Act Section 18115, on the other hand, drills down into the specifics of lab orders, such that we're being provided OBR, OBX, and PID segments to map to. These data elements would be more appropriately mapped to the Patient and Observation resources.

I loaded up a sample HHS Cares Act Measure, and MeasureReport, and my initial assessment is that we can try to shoehorn the CARES 18115 reporting requirements into the MeasureReport population codes, but it's really not what the MeasureReport is intended for.

Rather, we need to use the population codes to say 'Field Tests' with a count of 60; 'At Home Tests' with a count of 528, 'Positivity Rate' is equal to 15%, and so forth. Meanwhile, the population array in the MeasureReport is the perfect place to place a list of the scraped patients and lab results. Or, rather, MeasureReport becomes the ideal reporting bucket that v2.x messages get accumulated into.

So far, so good. While I normally don't recommend screen scraping because it *is* brittle, for the task at hand, I think it has the potential to be a relevant technology.

The question I have is around 'which website is to be scraped'? I grabbed the Indiana coronavirus webpage, which has aggregated stats, but it really doesn't have the laboratory data, since that's PHI.

So, my questions are along the lines of: a) can we get a better sample website to scrape, and b) would this scrapper need to operate behind a VPN or firewall (quite possibly), and c) do we have a list of non-FHIR enabled EHRs that we anticipate being used, and how many of them have web interfaces?

So far, I'm thinking screen-scraper is a workable kludge that might fill in the gaps for a lot of lesser supported EHR and LIS systems that have web interfaces but might not have v2 interfaces set up. Which might be very common situation for at-home testing vendors, who have a basic patient portal, but haven't ever had to actually address interoperability.

Reply 0

Share