Plural Policy Legislator Data Processing

Plural Policy Legislator Data Processing#

Overview

This notebook processes legislator biographical data from the Plural Policy (OpenStates) repository to create comprehensive legislator datasets with bioguide_id mappings. This data serves as a critical reference for matching legislators across different data sources in the Bridge Grades methodology.

The notebook generates two key datasets:

plural_legislators_with_bioguide.csv - Current legislators with bioguide_id mappings
plural_legislators_retired_with_bioguide.csv - Historical legislators with bioguide_id mappings

These datasets enable accurate legislator identification when processing bill sponsorship data from Plural Policy sources.

Data Sources#

Input Files#

OpenStates People Repository - YAML files containing legislator biographical data
Repository: openstates/people.git
Data Format: Individual YAML files per legislator

Data Source Details#

Source: OpenStates People Repository
Congress: Historical and current legislators
Collection Date: August 8, 2025
Coverage: Comprehensive legislator database with multiple identifier schemes

Outputs#

Current Legislators Dataset#

File: plural_legislators_with_bioguide.csv

Columns:

id: OpenStates unique identifier
name: Full name of the legislator
given_name: First name
family_name: Last name
birth_date: Date of birth
gender: Gender information
email: Contact email
image: Profile image URL
party: Political party affiliation
role_type: upper (Senate) or lower (House)
district: Congressional district
role_start_date: Term start date
role_end_date: Term end date
bioguide_id: Congressional bioguide identifier (critical for Bridge Grades)
social_*: Social media identifiers
*_id: Various other identifier schemes

Retired Legislators Dataset#

File: plural_legislators_retired_with_bioguide.csv

Description: Historical legislators with same structure as current dataset, enabling matching of historical bill sponsorship data.

Technical Requirements#

Dependencies#

pandas: Data manipulation and analysis
yaml: YAML file parsing
os: File system operations
collections.defaultdict: Efficient data structure handling

Data Processing Notes#

YAML Parsing: Handles complex nested YAML structures
Identifier Extraction: Processes multiple identifier schemes per legislator
Data Normalization: Standardizes format across all legislator records
Missing Value Handling: Graceful handling of incomplete records

Data Processing Pipeline#

Step 1: Repository Data Collection#

Accesses OpenStates people repository
Identifies all YAML files containing legislator data
Processes each file individually

Step 2: YAML Processing#

Parses complex nested YAML structures
Extracts biographical and role information
Handles multiple identifier schemes per legislator

Step 3: Data Normalization#

Standardizes field names and formats
Handles missing values appropriately
Creates consistent data structure

Step 4: Output Generation#

Separates current and retired legislators
Exports clean datasets to CSV format
Validates bioguide_id completeness

Usage in Bridge Grades Pipeline#

This dataset serves as the legislator reference for bill sponsorship processing:

Source A-B Processing: Enables bioguide_id matching for bill sponsorship data from Plural Policy
Data Quality Assurance: Provides comprehensive legislator identification for validation
Historical Analysis: Supports analysis of historical collaboration patterns
Cross-Reference Validation: Ensures data consistency across different sources

Critical Role: Essential for accurate legislator identification when processing bill sponsorship data, as it provides the bioguide_id mappings required to link Plural Policy data with other Bridge Grades sources.

Notebook Walkthrough: Plural Policy Legislator Data Processing#

This notebook demonstrates the process of extracting and standardizing legislator data from the OpenStates repository to create comprehensive legislator datasets with bioguide_id mappings.

Key Steps:

Repository Access: Load and parse YAML files from OpenStates repository
Data Extraction: Extract biographical and role information from nested YAML structures
Identifier Processing: Handle multiple identifier schemes including bioguide_id
Data Standardization: Create consistent format across all legislator records
Output Generation: Export current and retired legislator datasets

Expected Runtime: 1-2 minutes

# Import required libraries
import pandas as pd
import json
import os
#!pip install pyyaml
import yaml
from collections import defaultdict

Repository Access and File Discovery#

This section accesses the OpenStates people repository and identifies all YAML files containing legislator data. Each YAML file contains comprehensive biographical information for a single legislator.

Repository Structure#

Source: OpenStates people repository (openstates/people.git)
Format: Individual YAML files per legislator
Location: data/us/legislature/ directory
Coverage: Current and historical legislators

Note

Repository Setup Ensure you have cloned the OpenStates people repository locally before running this notebook. The repository contains thousands of YAML files with legislator data.

# clone repository from github at https://github.com/openstates/people.git
folder_path = "data/us/legislature"
files = [f for f in os.listdir(folder_path) if f.endswith(".yml")]
# print the files
print(files)

YAML Processing and Data Extraction#

This section processes the YAML files to extract legislator information and create a comprehensive dataset. The process handles complex nested YAML structures and multiple identifier schemes.

Processing Strategy#

Two-Pass Processing: First pass identifies all identifier schemes, second pass extracts data
Nested Structure Handling: Processes complex YAML hierarchies
Identifier Extraction: Handles multiple identifier schemes per legislator
Data Normalization: Creates consistent structure across all records

Key Data Fields#

Basic Information: Name, birth date, gender, contact information
Role Information: Chamber, district, term dates, party affiliation
Identifiers: bioguide_id, social media IDs, other identifier schemes
Metadata: Image URLs, email addresses, role history

Warning

Memory Considerations Processing thousands of YAML files can be memory-intensive. The two-pass approach helps manage memory usage by first identifying all possible identifier schemes.

# Convert all the yaml files to a dataframe
rows = []

# First pass: gather all possible identifier schemes
all_schemes = set()

for file in files:
    with open(os.path.join(folder_path, file), 'r', encoding='utf-8') as f:
        data = yaml.safe_load(f)
        for id_obj in data.get("other_identifiers", []):
            all_schemes.add(id_obj.get("scheme"))

# Now process each file with all identifier columns
for file in files:
    with open(os.path.join(folder_path, file), 'r', encoding='utf-8') as f:
        data = yaml.safe_load(f)

        # Prepare default row
        row = defaultdict(lambda: None)

        # Basic fields
        row["id"] = data.get("id")
        row["name"] = data.get("name")
        row["given_name"] = data.get("given_name")
        row["family_name"] = data.get("family_name")
        row["birth_date"] = data.get("birth_date")
        row["gender"] = data.get("gender")
        row["email"] = data.get("email")
        row["image"] = data.get("image")
        row["party"] = data.get("party", [{}])[0].get("name")

        # Roles
        role = data.get("roles", [{}])[-1]
        row["role_type"] = role.get("type")
        row["district"] = role.get("district")
        row["role_start_date"] = role.get("start_date")
        row["role_end_date"] = role.get("end_date")

        # Social media (if available)
        for k, v in data.get("ids", {}).items():
            row[f"social_{k}"] = v

        # Other identifiers
        for id_obj in data.get("other_identifiers", []):
            scheme = id_obj.get("scheme")
            identifier = id_obj.get("identifier")
            row[f"{scheme}_id"] = identifier

        rows.append(row)

# Create DataFrame
df = pd.DataFrame(rows)

# Fill in missing columns for any scheme not found in every legislator
for scheme in all_schemes:
    col_name = f"{scheme}_id"
    if col_name not in df.columns:
        df[col_name] = None

df.columns

Index(['id', 'name', 'given_name', 'family_name', 'birth_date', 'gender',
       'email', 'image', 'party', 'role_type', 'district', 'role_start_date',
       'role_end_date', 'social_twitter', 'social_facebook', 'ballotpedia_id',
       'bioguide_id', 'fec_id', 'google_entity_id_id', 'govtrack_id',
       'house_history_id', 'icpsr_id', 'opensecrets_id', 'pictorial_id',
       'thomas_id', 'votesmart_id', 'wikidata_id', 'wikipedia_id',
       'social_youtube', 'cspan_id', 'maplight_id', 'lis_id'],
      dtype='object')

Data Validation and Quality Assurance#

This section validates the processed data to ensure completeness and quality. We check for missing bioguide_id values and verify data integrity.

Validation Steps#

Missing Value Check: Identify records without bioguide_id
Data Completeness: Verify all expected fields are present
Identifier Validation: Ensure bioguide_id format consistency
Record Count Verification: Confirm expected number of legislators

Note

bioguide_id Importance The bioguide_id is the critical identifier that links this dataset with bill sponsorship data from Plural Policy. Records without bioguide_id cannot be used in the Bridge Grades pipeline.

df.head()

	id	name	given_name	family_name	birth_date	gender	email	image	party	role_type	...	opensecrets_id	pictorial_id	thomas_id	votesmart_id	wikidata_id	wikipedia_id	social_youtube	cspan_id	maplight_id	lis_id
0	ocd-person/79575558-ef44-5bb5-9c64-3d3fe3fb4427	Kweisi Mfume	Kweisi	Mfume	1948-10-24	Male	https://mfume.house.gov/address_authentication...	https://unitedstates.github.io/images/congress...	Democratic	lower	...	N00001799	13090	00798	26892	Q519504	NaN	NaN	NaN	NaN	NaN
1	ocd-person/9db37a87-2ba9-56a0-9b42-89697222e044	Carlos Giménez	Carlos	Giménez	1954-01-17	Male	https://gimenez.house.gov/contact	https://unitedstates.github.io/images/congress...	Republican	lower	...	N00046394	13009	NaN	81366	Q5041653	Carlos A. Giménez	NaN	NaN	NaN	NaN
2	ocd-person/84a22f15-cf83-5f0b-a048-a6fc50aa60fe	Chris Smith	Chris	Smith	1953-03-04	Male	https://chrissmith.house.gov/contact/zipauth.htm	https://unitedstates.github.io/images/congress...	Republican	lower	...	N00009816	13153	01071	26952	Q981167	Chris Smith (New Jersey politician)	UCtCNUDo3-I1gsd_03ppDfZg	6411	469	NaN
3	ocd-person/bac6c65d-846b-5e60-9532-fa216c99ccf6	Mike Turner	Mike	Turner	1960-01-11	Male	https://turner.house.gov/email-me	https://unitedstates.github.io/images/congress...	Republican	lower	...	N00025175	13204	01741	45519	Q505722	Mike Turner	UC-6Ss-aZ3OPisf9GdVGtN8g	1003607	496	NaN
4	ocd-person/999df51b-9318-55b9-b0f6-d738ffc1d62d	Eric Schmitt	Eric	Schmitt	1975-06-20	Male	https://www.schmitt.senate.gov/contact/	https://unitedstates.github.io/images/congress...	Republican	upper	...	N00048414	13416	NaN	104474	Q5387455	NaN	NaN	NaN	NaN	S420

5 rows × 32 columns

Dataset Export and Finalization#

This section exports the processed legislator data to CSV format for use in subsequent Bridge Grades processing steps.

Export Process#

Data Finalization: Ensure all data is properly formatted
CSV Export: Save dataset to CSV format
File Naming: Use descriptive filename for easy identification
Quality Check: Verify export completeness

Output Files#

plural_legislators_with_bioguide.csv: Current legislators with bioguide_id mappings
plural_legislators_retired_with_bioguide.csv: Historical legislators with bioguide_id mappings

Next Steps

These datasets will be used in the Source A-B processing notebook to enable accurate legislator identification when processing bill sponsorship data from Plural Policy.

df[["id", "name", "bioguide_id"]]

	id	name	bioguide_id
0	ocd-person/79575558-ef44-5bb5-9c64-3d3fe3fb4427	Kweisi Mfume	M000687
1	ocd-person/9db37a87-2ba9-56a0-9b42-89697222e044	Carlos Giménez	G000593
2	ocd-person/84a22f15-cf83-5f0b-a048-a6fc50aa60fe	Chris Smith	S000522
3	ocd-person/bac6c65d-846b-5e60-9532-fa216c99ccf6	Mike Turner	T000463
4	ocd-person/999df51b-9318-55b9-b0f6-d738ffc1d62d	Eric Schmitt	S001227
...	...	...	...
533	ocd-person/86b65fd7-4549-51fa-80e8-eaf3daf3e60e	Lou Correa	C001110
534	ocd-person/10de7024-4b40-57b0-ae78-101372fd4a02	Pat Ryan	R000579
535	ocd-person/39f36070-b860-5345-a3cc-ea7fdbf7dfb3	Adrian Smith	S001172
536	ocd-person/74d8d6c8-c349-4fe7-ae18-62c69c4f8d4b	Craig Goldman	G000601
537	ocd-person/bb0d60f6-19a4-5229-a67d-a1d428e7c0b2	Ben Cline	C001118

538 rows × 3 columns

# are there any missing bioguide ids?
df[df['bioguide_id'].isna()].head()

	id	name	given_name	family_name	birth_date	gender	email	image	party	role_type	...	opensecrets_id	pictorial_id	thomas_id	votesmart_id	wikidata_id	wikipedia_id	social_youtube	cspan_id	maplight_id	lis_id

0 rows × 32 columns

# Save file to csv
df.to_csv("plural_legislators_with_bioguide.csv", index=False)