Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0

Present: ArezoArezoo, Aslak, Gro, Endre, Ingrid (referent), Rune

Endre present:

M 43 Aids

...

Finding the optimal way of deidentify de-identify data is hard. There are so many different combinations that it will be hard to do in a limited amount of time. You have to find the balance between keeping information and making the data anonymous.There is no obvious way to do clustering of deseasesdiseases. No obvious way of how to group the data.

what endres Endre's program do: collect all values from each row. prog program automatically gives hierarchies.

cornell Cornell - program that do the same, but you have to make the tables yourself. very time consuming. hard to create the hierarchies.

Sensitive attribute - the one that will not be deidentifiedde-identified. the other variables will be deidentified de-identified while the sensitive attribute stays as it is. 

Proj Project 2: (randveigRandveig)

Patient records.

she suspected that info in health records were of bad quality. If you search for a person a lot of info is not filled in and so on. she wanted to find out how bad ithe the quality was. check quality of info in halth health records. Endre worked with her with deidentifying de-identifying data (dates and so on). prob: dates were not formated formatted in the same way. one patient could have several ids.

Ingrid:

ta hensyn til induvidual (legentake into account the individual (doctor). suggestions for icd-10 codes - red, green and so on

Rune's notes:

Endre:
" M 34 AIDS" example

Grouping DRG codes: 10-102, 104B-107A,
GRO:
 DRG is categorical variables (They are hierarchical.)
 -Endre: The computer has no way of knowing this...
 
Have you seen the aggregated NPR cubes.
 ("Cognos"? create a table at aggregate levels of DRG cubes.)

Gro: De-identified vs. anonymous (not back-trackable)

Rune: We will never get truly anonymous data.
--One easy way out is to avoid "complete" data. If you remove

Aslak:
We predefine which groups we need (age, diagnoses, etc).
--Can we anonymize 10 or 100 variables
--We can only do 15 columns (anonymized) with the German Flash algorithm.
--If we have 100 columns (We need three columns to be 5-diverse). That's no problem.
--In one day: How many columns.
--10 or 11 columns in half a minute.

GRO:
Two directions:
--One public anonymous table for NSD
--One private semi-anonymous table for own research.

Endre:
We can throw away 10% of the "outliers".
--Rune: Than we cannot do back-tracking anymore, right?

NEXT STEP
Rune:

What is the question that Rannveig (HEMIT) wants to answer:
--Is the information in NPR of good enough quality?
--Does it match the quality of PAS? of the hospital internal records?
Aslak: What did you do?
--De-identifying data, like dates and so on.
--Representation of time, hard to make uniform between systems. Gro: Why?
---It was not represented in the same way within one file.
---Several patient IDS pr one patient.
--Gro: What if you stripped away new-borns and immigrants that don't have proper IDs.
Aslak: More about what you did?
--Anonymize: Every person had a day1 when s/he enter the hospital. (Normalize dates)

Where can we download the Kanon (Flash), program?
--SVN, update SVN+repository page in Wiki

How can we run Kanon?

Gro:
Why did you have to work with dates to de-identify the data?

Endre:
Rannbeigs workflow:
1) Pick out of the DB with complex queries
2) Run Endre's python programs on PAS, NPR,
3) Make sure this is re-usable!

THIRD STEP
What did you do for Kjartan?
--He wanted to run Kanon on his data, but the data was too big.
--Don't include too many columns. (as non-sensitive).

RUNES WORK FOR PASTAS

Take UNNis data and visualize
-Solid line for 24-hours of day services
-Dashed line with annotation hours per week for services

In the future

Concentrate on Evicare from 14-15, and Pastas from 15-16