The Problem

12.7 million Americans now covered by private insurers because of ACA

Most have never had insurance coverage until ACA/Obamacare

The insurance enrollment process is challenging when factoring in chronic conditions, care provider relationships and drug coverage

If you have employer-provided coverage think about what it's like for you every year for enrollment

Comparing insurance plans for what really matters is almost impossible

Federal and State healthcare exchanges find the data provided by insurers is incomplete and the only way to get the facts is to manually track it down and know to ask the right questions.

Doctors that are supposed to be in-network often aren't, specialists that are commonly required to treat diseases are not in the plan or too far away to be accessible.

Demographics

Sick/Aging with no current coverage due to loss of employment, self employed, no coverage from employer, can't afford or not eligible for Medicare/Medicaid
Prior medical conditions

What's Working

Just the fact that people can get coverage is great
First time coverage for many helps them get preventative care - overall benefit in the long run
It stops here

What's Not Working

Lack of transparency! So much looking up, calling offices, etc. to be sure you're signing up for the right plan.
Partial coverage surprises
Data issues/gaps - what the enrollment system shows, is not necessarily true.

What if you could shop for insurance at Amazon?

Facets

Like brand, size, cost
and color become
insurer ,cost, co-pay & drugs

Recommendations

People like you chose...

Pickup Locations

Become Providers or
Specialists Near Me

Ranked Search & Ratings

Sponsored Searches, ratings for
insurers & providers

We Can Do Better...

...by combining Semantic Search and Machine Learning

Web Development

Semantic Health is a web application that helps people choose the right Obamacare plan. Our users interact with our product through this inteface only. We want this application to have a number of features that would help them accomplish their goals. Here is a list of requirements for the webpage:

Accept and validate input form.
Show insurance plans in an attractive user interface.
Interactivity via faceted search.
Rank results using LETOR algorithm.
Collect clickstream data.

Implementation

Flask

The above schematic shows all the components in our web application (see code for all components). All of the requirements were satisfied by this setup. Because the webserver is based on Flask (Python), it is very easy to get a production server up and running. Our web server manages all the backend web functionalities. It takes in the user's input, validates it, and sends the query to the frontend. It also has endpoints to log the visitor and clickstream data in a AWS RDS PostgreSQL database. The production server is deployed on AWS Elastic Beanstalk, which takes care of Apache web server administration and provides load balancing and instance autoscaling.

Searchkit

The frontend utilized many elements from searchkit, an elasticsearch UI framework based on React.js. We chose searchkit because it gave us a substantial headstart in the development of the UI. It is easy to use and extremely customizable. We were able to quickly build an interactive application which allows the user to do faceted search and input search. A custom rescore query component allows the results to be re-ranked based on the LETOR algorithm. Various event handlers were also present to assist in the recording of clickstream data.

Demonstration

User input form.
Autocomplete for medical conditions (scraped from drugs.com).
Left: Filters ordered by importance. From our research, we know that providers, drugs, and specialties are top priorities.
Right: Insurance plans' details. Plans are ranked by LETOR algorithm if the input query is seen before.

Learn to Rank

Learn to Rank (LETOR) charaterizes search context for each user query, records click-through rate data (CTR), and builds a ranking model to capture the relationship between user preference and plan attributes, given a particular health status and realistic consideration. Plan attributes, including policy category, drug and provider coverage are extracted into feature vector. Pair-wise user preference is harvested based on reletive importance from browsing history. Plan returned by ElasticSearch will be adjusted at runtime by rankings from the similar context previously leanred by LETOR.

Implementation

Feature Extraction

For each plan, features are extracted from 3 data collections from MongoDB: plan, drug, and provider.
Two types of feature are evaluated for each data collection: summary statistics of one-hot-encoded attributes, and binary flag for drug and provider coverage.
MongoDB aggregation pipeline is executed to extract features from each data collection.
Pair-wise difference is calculated with the feature vector as LETOR input according to the CTR principle: lower ranked clicked plan is more important than higher ranked unclicked plans.
Feature vectors are represented by sparse matrix (csr) and stored on S3 for training procedure to use.
The extraction process runs ad-hocly when the raw json data is updated.

Training

A k-mean clustering algorithm with cosine similarity is developed to group search context.
Pair-wise feature is evaluated within the ranking of a query.
Input features are grouped together from queries in the same search context.
A Support Vector Machine (SVM) model is trained for each search context, and the plans under the context is ranked on the normal direction of max-margin plane given by SVM.
Rank weights for each plan in the context is indexed on ElasticSearch server.
K-mean model is saved on S3 for online prediction.

Ranking

Evaluate incoming query with the search context previously learned by LETOR.
No LETOR ranking will be given if the incoming query is brand new.
Otherwise a new ranking will be given based on weighted sum from similar search context.
LETOR ranking weight, ranging between 0 and 5, will be added to ElasticSearch score, ranging between 0 and 1, such that higher letor-ranked plan will have bigger score and will be subsequently ranked higher in the final return.

Examples

Example 1: letor captures relation between health and drug coverage.
Example 2: letor captures relation between health and provider coverage.
Example 3: letor gives rank based on similar search context.

Data Ingest

The data that drives the SemanticHealth.net site is served up by an Elasticsearch cluster. The raw data from the different data sources is stored and processed before being placed into an index. Healthcare plans are shown based on the adjusted rankings computed in the cluster.

Data Processing Flow

Data Acquisition

The Machine Readable Data Processor traverses the links of the Machine Readable PUF to download JSON files for the Provider, Plan and Formularies by state, insuror, and plan. These files are streamed to local files then transferred to S3 which makes it easier to re-ingest the data as needed. The Machine Readable URL PUF contains approximately 36K URLs that must be followed to get all the data for all the states that participate in the national ACA exchange at Healthcare.gov. The JSON files themselves range in size from a few hundred kilobbytes to several gigabytes each.

The Extended Data Processor processes a number of other data sources from CMS.gov to augment the core Plan, Provider, and Formulary data such as rate information, detailed plan attributes, co-pay and cost schedules. Logo URLs and drug information is scraped from the web and along with the intermediate data sources are stored in a PostgreSQL database for ease of filtering and selection.

The JSON Document Processor Tracking tables in a PostgreSQL database are used to track the state of downloads and processing making it simpler to re-start the process or pick up where left off if an error occurs. Ultimately the JSON files end up as MongoDB documents which allow for easy querying and augmentation with geo-location data.

The Plan Indexer assembles a document for each plan based on queries of the RDBMS and NoSQL databases and then uses the bulk indexing API to populate the documents for the index. The documents conform to a specific mapping to get the best performance for the Elasticsearch queries.

The Plan Rank Updater process runs every 3 hours to update the plan documents in the Elasticsearch index with the latest LETOR ranking data.

Data Scraping

Besides the main data source used for the SemanticHealth project, from CMS.gov Healthcare MarketPlace Data Sets, we collected additional external data sets to further enhance search functionality and thereby improve overall user experience. Three of the external data sets we focused on were:

Drugs and associated Diseases/Conditions : This will allow users to search for plans by specific disease or condition they have. This is also used for an auto-complete feature, when users have to enter a disease/condition before searching for plans.

Geo-coding on over a million provider addresses : This will allow users to look up Providers that are near-by, based on the plan they selected.

Provider ratings (Yelp, HealthGrades, Vitals , UcompareHealthcare) : This data would enhance Learning and Ranking based on existing provider ratings (Note: We were not able to fully collect Provider Ratings due to some technical limitations we ran into - discussed below)

Drugs and associated Diseases/Conditions

Drugs, Associated Conditions were scraped from following sources:

Data from Drugs.com was more comprehensive and included additional information related to impact of the drug on Pregnancy, Controlled Substance Abuse potential, Alcohol impact etc. These additional data elements can eventually be used to enhance Plan Search and provides additional features that can be used for LETOR or other Learning Algorithms.

Geo-Locatins for over a Million provider addresses

Pulling Geocodes for over a millon providers proved to be a significant challenge at first. Most data sources and APIs out there had daily limitation on the number of free Geo pulls and additional pulls were prohibitively expensive, at least for the purposes of this project. In the end we were very excited that SmartyStreets (Thank you Jefferson!) offered to support our project by providing an unlimited pull license which we eventually used, to pull Geo Locations for over a million plus providers.

We use three different APIs to pull Geo Locations in a waterfall like method from three different sources listed below. Eventually those that return empty are saved separately to re-try at a later time.

Provider Ratings

Provider Ratings were scraped using project forked from PCPInvestigator, written by Tory Hoke.

The PCPInvestigator supports scraping from four different Health Care Provider rating sites:

Although the code is functional and tested, it has one major limitation with running on a large set of providers - the code uses Google to do a search and then scrapes the search results for any mention from one of the rating sites and then crawls the specific rating websites. Even with randomizing request headers and timing between request, search results returned 403 errors, mostly from Google blocking IP calls. So we were not able to fully utlize Provider ratings at this time, for the project.

Data Processing Infrastructure

Amazon AWS

All infrastructure built and managed with Vagrant and Ansible.

Amazon AWS services used:

S3: Object storage for the data files and database backups
EC2: Compute and attached SSD storage for MongoDB
RDS: Managed PostgreSQL instance for RDBMS functionality
ES: Elasticsearch service for initial index testing

IBM SoftLayer

IBM SoftLayer services used:

Virtual machines and network storage for Elasticsearch cluster nodes
Virtual machine and network storage for HAProxy
Virtual machines for general Data Scraping

PostgreSQL RDBMS

Single 20GB instance with automated backup used for state tracking and auxilliary data storage

MongoDB 3.2 NoSQL Database

Single t2.large EC2 instance with CentOS 7.2.
256GB SSD /data volume
25GB SSD /log volume
10GB SSD /journal volume
10GB SSD / volume

Elasticsearch 2.3.5 Cluster

4 x IBM SoftLayer Virtual Servers with CentOS 7.2.
4 CPU, 16GB RAM
512GB SAN /data volume
25GB SAN /log volume
10GB SAN / volume

HAProxy 1.6.8 Load Balancer

1 x IBM SoftLayer Virtual Servers with CentOS 7.2.
2 CPU, 4GB RAM
10GB SAN / volume

There are many possibilities for next steps for commercial and functional concerns.

Sponsored Search: Just as plan rankings can be adjusted to surface the best plan, rankings can be adjusted to coordinate with sponsors. This is a potential revenue stream with significant potential. Considering that the health insurance advertising expenditure is nearly $20 billion, just a percentage of that budget is significant. Considering the targeting capabilities of the Semantic Health platform insurance advertisers will get significantly more bang for their buck.

Search Data and Purchase Data: Semantic Health has data that's not available to anyone else - what people actually look for and consider when searching or shopping for a health insurance plan.

Chatbots: many common questions can be answered with chatbot technology used to hone the personalization experience of Semantic Health. Chatbots can significantly reduce the needs for human agents to step in and take over the more complex activities that may be required to finalize an insurance plan purchase.

Agents and Agency: by becoming a licensed insurance seller in the states that require it the Semantic Health data is extended beyond browsing to purchasing, increasing the value of the data and measurable results of advertising dollars. Human agents can assist with those more complex issues.

Free Form Query: combined with voice input, free form text queries provide a level of interaction unknown in the insurance industry today, all without human agency. The ability to pull out the key pieces of information in a statement like, "I'm a housewife from central Minnesota with 2 children, ages 4 and 6 and a husband with Type 2 diabetes" makes the entire experience much more user friendly. Leveraging various voice API's available to provide the voice to text combined with NLP processing to extract the meanings of the sentence we can leverage much of what already exists today"

Mobile: this seems like a no-brainer considering the ubiquitousness of mobile devices over computers. Combined with free form voice query this can be an amazingly effective means of shopping for insurance.

Thank You!

The SemanticHealth Project would not have been possible without the help and support of many people. We express our sincere thanks and deepest gratitude to all those that have helped us along the way.

Instructors

David Steier & Coco Krumme : Thank you, for an awesome learning experience during the Capstone Class and also for your invaluable feedback, support and guidance during various stages of the project.

Jimi Shanahan - Thank you Jimi for the motiviation and unwavering support - you showed us how much farther we are able to go, and that's just the first mountain!

The Innocent Bayestanders

Safyre Anderson, Chris Dailey, Marjorie Sayer : Members of the original BayesHack 2016 Team that helped lay much of the ground work for this project

Mentors

David Portnoy : Thank you for showing us just one way out of many we can contribute to make things better for everyone and the support and encouragement during the BayesHack 2016 HHS prompt.

Naama O. Pozniak (Owner @ A+ Insurance Service) : Thank you so much for the invaluable insight you provided into real issues people are facing today with ACA/Obamacare. This has helped us tremendously.

Tools:

Jefferson from SmartyStreets.com : Thank you so much for allowing us to use SmartyStreets for GeoCoding over a million Provider addresses. Could not have done it without your help!

Choose the Right Obamacare Plan

The Problem

Demographics

What's Working

What's Not Working

What if you could shop for insurance at Amazon?

We Can Do Better...

...by combining Semantic Search and Machine Learning

Data Sources

Web Development

Implementation

Flask

Searchkit

Demonstration

Implementation

Examples

Data Processing Flow

Data Acquisition

Data Processing Infrastructure

Thank You!

Instructors

The Innocent Bayestanders

Mentors

Tools:

Team