Plural Policy Legislator Data Processing#
Overview
This notebook processes legislator biographical data from the Plural Policy (OpenStates) repository to create comprehensive legislator datasets with bioguide_id mappings. This data serves as a critical reference for matching legislators across different data sources in the Bridge Grades methodology.
The notebook generates two key datasets:
plural_legislators_with_bioguide.csv
- Current legislators with bioguide_id mappingsplural_legislators_retired_with_bioguide.csv
- Historical legislators with bioguide_id mappings
These datasets enable accurate legislator identification when processing bill sponsorship data from Plural Policy sources.
Data Sources#
Input Files#
OpenStates People Repository - YAML files containing legislator biographical data
Repository: openstates/people.git
Data Format: Individual YAML files per legislator
Data Source Details#
Source: OpenStates People Repository
Congress: Historical and current legislators
Collection Date: August 8, 2025
Coverage: Comprehensive legislator database with multiple identifier schemes
Outputs#
Current Legislators Dataset#
File: plural_legislators_with_bioguide.csv
Columns:
id
: OpenStates unique identifiername
: Full name of the legislatorgiven_name
: First namefamily_name
: Last namebirth_date
: Date of birthgender
: Gender informationemail
: Contact emailimage
: Profile image URLparty
: Political party affiliationrole_type
: upper (Senate) or lower (House)district
: Congressional districtrole_start_date
: Term start daterole_end_date
: Term end datebioguide_id
: Congressional bioguide identifier (critical for Bridge Grades)social_*
: Social media identifiers*_id
: Various other identifier schemes
Retired Legislators Dataset#
File: plural_legislators_retired_with_bioguide.csv
Description: Historical legislators with same structure as current dataset, enabling matching of historical bill sponsorship data.
Technical Requirements#
Dependencies#
pandas
: Data manipulation and analysisyaml
: YAML file parsingos
: File system operationscollections.defaultdict
: Efficient data structure handling
Data Processing Notes#
YAML Parsing: Handles complex nested YAML structures
Identifier Extraction: Processes multiple identifier schemes per legislator
Data Normalization: Standardizes format across all legislator records
Missing Value Handling: Graceful handling of incomplete records
Data Processing Pipeline#
Step 1: Repository Data Collection#
Accesses OpenStates people repository
Identifies all YAML files containing legislator data
Processes each file individually
Step 2: YAML Processing#
Parses complex nested YAML structures
Extracts biographical and role information
Handles multiple identifier schemes per legislator
Step 3: Data Normalization#
Standardizes field names and formats
Handles missing values appropriately
Creates consistent data structure
Step 4: Output Generation#
Separates current and retired legislators
Exports clean datasets to CSV format
Validates bioguide_id completeness
Usage in Bridge Grades Pipeline#
This dataset serves as the legislator reference for bill sponsorship processing:
Source A-B Processing: Enables bioguide_id matching for bill sponsorship data from Plural Policy
Data Quality Assurance: Provides comprehensive legislator identification for validation
Historical Analysis: Supports analysis of historical collaboration patterns
Cross-Reference Validation: Ensures data consistency across different sources
Critical Role: Essential for accurate legislator identification when processing bill sponsorship data, as it provides the bioguide_id mappings required to link Plural Policy data with other Bridge Grades sources.
Notebook Walkthrough: Plural Policy Legislator Data Processing#
This notebook demonstrates the process of extracting and standardizing legislator data from the OpenStates repository to create comprehensive legislator datasets with bioguide_id mappings.
Key Steps:
Repository Access: Load and parse YAML files from OpenStates repository
Data Extraction: Extract biographical and role information from nested YAML structures
Identifier Processing: Handle multiple identifier schemes including bioguide_id
Data Standardization: Create consistent format across all legislator records
Output Generation: Export current and retired legislator datasets
Expected Runtime: 1-2 minutes
# Import required libraries
import pandas as pd
import json
import os
#!pip install pyyaml
import yaml
from collections import defaultdict
Repository Access and File Discovery#
This section accesses the OpenStates people repository and identifies all YAML files containing legislator data. Each YAML file contains comprehensive biographical information for a single legislator.
Repository Structure#
Source: OpenStates people repository (openstates/people.git)
Format: Individual YAML files per legislator
Location:
data/us/legislature/
directoryCoverage: Current and historical legislators
Note
Repository Setup Ensure you have cloned the OpenStates people repository locally before running this notebook. The repository contains thousands of YAML files with legislator data.
# clone repository from github at https://github.com/openstates/people.git
folder_path = "data/us/legislature"
files = [f for f in os.listdir(folder_path) if f.endswith(".yml")]
# print the files
print(files)
YAML Processing and Data Extraction#
This section processes the YAML files to extract legislator information and create a comprehensive dataset. The process handles complex nested YAML structures and multiple identifier schemes.
Processing Strategy#
Two-Pass Processing: First pass identifies all identifier schemes, second pass extracts data
Nested Structure Handling: Processes complex YAML hierarchies
Identifier Extraction: Handles multiple identifier schemes per legislator
Data Normalization: Creates consistent structure across all records
Key Data Fields#
Basic Information: Name, birth date, gender, contact information
Role Information: Chamber, district, term dates, party affiliation
Identifiers: bioguide_id, social media IDs, other identifier schemes
Metadata: Image URLs, email addresses, role history
Warning
Memory Considerations Processing thousands of YAML files can be memory-intensive. The two-pass approach helps manage memory usage by first identifying all possible identifier schemes.
# Convert all the yaml files to a dataframe
rows = []
# First pass: gather all possible identifier schemes
all_schemes = set()
for file in files:
with open(os.path.join(folder_path, file), 'r', encoding='utf-8') as f:
data = yaml.safe_load(f)
for id_obj in data.get("other_identifiers", []):
all_schemes.add(id_obj.get("scheme"))
# Now process each file with all identifier columns
for file in files:
with open(os.path.join(folder_path, file), 'r', encoding='utf-8') as f:
data = yaml.safe_load(f)
# Prepare default row
row = defaultdict(lambda: None)
# Basic fields
row["id"] = data.get("id")
row["name"] = data.get("name")
row["given_name"] = data.get("given_name")
row["family_name"] = data.get("family_name")
row["birth_date"] = data.get("birth_date")
row["gender"] = data.get("gender")
row["email"] = data.get("email")
row["image"] = data.get("image")
row["party"] = data.get("party", [{}])[0].get("name")
# Roles
role = data.get("roles", [{}])[-1]
row["role_type"] = role.get("type")
row["district"] = role.get("district")
row["role_start_date"] = role.get("start_date")
row["role_end_date"] = role.get("end_date")
# Social media (if available)
for k, v in data.get("ids", {}).items():
row[f"social_{k}"] = v
# Other identifiers
for id_obj in data.get("other_identifiers", []):
scheme = id_obj.get("scheme")
identifier = id_obj.get("identifier")
row[f"{scheme}_id"] = identifier
rows.append(row)
# Create DataFrame
df = pd.DataFrame(rows)
# Fill in missing columns for any scheme not found in every legislator
for scheme in all_schemes:
col_name = f"{scheme}_id"
if col_name not in df.columns:
df[col_name] = None
df.columns
Index(['id', 'name', 'given_name', 'family_name', 'birth_date', 'gender',
'email', 'image', 'party', 'role_type', 'district', 'role_start_date',
'role_end_date', 'social_twitter', 'social_facebook', 'ballotpedia_id',
'bioguide_id', 'fec_id', 'google_entity_id_id', 'govtrack_id',
'house_history_id', 'icpsr_id', 'opensecrets_id', 'pictorial_id',
'thomas_id', 'votesmart_id', 'wikidata_id', 'wikipedia_id',
'social_youtube', 'cspan_id', 'maplight_id', 'lis_id'],
dtype='object')
Data Validation and Quality Assurance#
This section validates the processed data to ensure completeness and quality. We check for missing bioguide_id values and verify data integrity.
Validation Steps#
Missing Value Check: Identify records without bioguide_id
Data Completeness: Verify all expected fields are present
Identifier Validation: Ensure bioguide_id format consistency
Record Count Verification: Confirm expected number of legislators
Note
bioguide_id Importance The bioguide_id is the critical identifier that links this dataset with bill sponsorship data from Plural Policy. Records without bioguide_id cannot be used in the Bridge Grades pipeline.
df.head()
id | name | given_name | family_name | birth_date | gender | image | party | role_type | ... | opensecrets_id | pictorial_id | thomas_id | votesmart_id | wikidata_id | wikipedia_id | social_youtube | cspan_id | maplight_id | lis_id | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ocd-person/79575558-ef44-5bb5-9c64-3d3fe3fb4427 | Kweisi Mfume | Kweisi | Mfume | 1948-10-24 | Male | https://mfume.house.gov/address_authentication... | https://unitedstates.github.io/images/congress... | Democratic | lower | ... | N00001799 | 13090 | 00798 | 26892 | Q519504 | NaN | NaN | NaN | NaN | NaN |
1 | ocd-person/9db37a87-2ba9-56a0-9b42-89697222e044 | Carlos Giménez | Carlos | Giménez | 1954-01-17 | Male | https://gimenez.house.gov/contact | https://unitedstates.github.io/images/congress... | Republican | lower | ... | N00046394 | 13009 | NaN | 81366 | Q5041653 | Carlos A. Giménez | NaN | NaN | NaN | NaN |
2 | ocd-person/84a22f15-cf83-5f0b-a048-a6fc50aa60fe | Chris Smith | Chris | Smith | 1953-03-04 | Male | https://chrissmith.house.gov/contact/zipauth.htm | https://unitedstates.github.io/images/congress... | Republican | lower | ... | N00009816 | 13153 | 01071 | 26952 | Q981167 | Chris Smith (New Jersey politician) | UCtCNUDo3-I1gsd_03ppDfZg | 6411 | 469 | NaN |
3 | ocd-person/bac6c65d-846b-5e60-9532-fa216c99ccf6 | Mike Turner | Mike | Turner | 1960-01-11 | Male | https://turner.house.gov/email-me | https://unitedstates.github.io/images/congress... | Republican | lower | ... | N00025175 | 13204 | 01741 | 45519 | Q505722 | Mike Turner | UC-6Ss-aZ3OPisf9GdVGtN8g | 1003607 | 496 | NaN |
4 | ocd-person/999df51b-9318-55b9-b0f6-d738ffc1d62d | Eric Schmitt | Eric | Schmitt | 1975-06-20 | Male | https://www.schmitt.senate.gov/contact/ | https://unitedstates.github.io/images/congress... | Republican | upper | ... | N00048414 | 13416 | NaN | 104474 | Q5387455 | NaN | NaN | NaN | NaN | S420 |
5 rows × 32 columns
Dataset Export and Finalization#
This section exports the processed legislator data to CSV format for use in subsequent Bridge Grades processing steps.
Export Process#
Data Finalization: Ensure all data is properly formatted
CSV Export: Save dataset to CSV format
File Naming: Use descriptive filename for easy identification
Quality Check: Verify export completeness
Output Files#
plural_legislators_with_bioguide.csv
: Current legislators with bioguide_id mappingsplural_legislators_retired_with_bioguide.csv
: Historical legislators with bioguide_id mappings
Next Steps
These datasets will be used in the Source A-B processing notebook to enable accurate legislator identification when processing bill sponsorship data from Plural Policy.
df[["id", "name", "bioguide_id"]]
id | name | bioguide_id | |
---|---|---|---|
0 | ocd-person/79575558-ef44-5bb5-9c64-3d3fe3fb4427 | Kweisi Mfume | M000687 |
1 | ocd-person/9db37a87-2ba9-56a0-9b42-89697222e044 | Carlos Giménez | G000593 |
2 | ocd-person/84a22f15-cf83-5f0b-a048-a6fc50aa60fe | Chris Smith | S000522 |
3 | ocd-person/bac6c65d-846b-5e60-9532-fa216c99ccf6 | Mike Turner | T000463 |
4 | ocd-person/999df51b-9318-55b9-b0f6-d738ffc1d62d | Eric Schmitt | S001227 |
... | ... | ... | ... |
533 | ocd-person/86b65fd7-4549-51fa-80e8-eaf3daf3e60e | Lou Correa | C001110 |
534 | ocd-person/10de7024-4b40-57b0-ae78-101372fd4a02 | Pat Ryan | R000579 |
535 | ocd-person/39f36070-b860-5345-a3cc-ea7fdbf7dfb3 | Adrian Smith | S001172 |
536 | ocd-person/74d8d6c8-c349-4fe7-ae18-62c69c4f8d4b | Craig Goldman | G000601 |
537 | ocd-person/bb0d60f6-19a4-5229-a67d-a1d428e7c0b2 | Ben Cline | C001118 |
538 rows × 3 columns
# are there any missing bioguide ids?
df[df['bioguide_id'].isna()].head()
id | name | given_name | family_name | birth_date | gender | image | party | role_type | ... | opensecrets_id | pictorial_id | thomas_id | votesmart_id | wikidata_id | wikipedia_id | social_youtube | cspan_id | maplight_id | lis_id |
---|
0 rows × 32 columns
# Save file to csv
df.to_csv("plural_legislators_with_bioguide.csv", index=False)