Plural Policy Legislator Data Processing#

Overview

This notebook processes legislator biographical data from the Plural Policy (OpenStates) repository to create comprehensive legislator datasets with bioguide_id mappings. This data serves as a critical reference for matching legislators across different data sources in the Bridge Grades methodology.

The notebook generates two key datasets:

  • plural_legislators_with_bioguide.csv - Current legislators with bioguide_id mappings

  • plural_legislators_retired_with_bioguide.csv - Historical legislators with bioguide_id mappings

These datasets enable accurate legislator identification when processing bill sponsorship data from Plural Policy sources.

Data Sources#

Input Files#

  • OpenStates People Repository - YAML files containing legislator biographical data

  • Repository: openstates/people.git

  • Data Format: Individual YAML files per legislator

Data Source Details#

  • Source: OpenStates People Repository

  • Congress: Historical and current legislators

  • Collection Date: August 8, 2025

  • Coverage: Comprehensive legislator database with multiple identifier schemes


Outputs#

Current Legislators Dataset#

File: plural_legislators_with_bioguide.csv

Columns:

  • id: OpenStates unique identifier

  • name: Full name of the legislator

  • given_name: First name

  • family_name: Last name

  • birth_date: Date of birth

  • gender: Gender information

  • email: Contact email

  • image: Profile image URL

  • party: Political party affiliation

  • role_type: upper (Senate) or lower (House)

  • district: Congressional district

  • role_start_date: Term start date

  • role_end_date: Term end date

  • bioguide_id: Congressional bioguide identifier (critical for Bridge Grades)

  • social_*: Social media identifiers

  • *_id: Various other identifier schemes

Retired Legislators Dataset#

File: plural_legislators_retired_with_bioguide.csv

Description: Historical legislators with same structure as current dataset, enabling matching of historical bill sponsorship data.


Technical Requirements#

Dependencies#

  • pandas: Data manipulation and analysis

  • yaml: YAML file parsing

  • os: File system operations

  • collections.defaultdict: Efficient data structure handling

Data Processing Notes#

  • YAML Parsing: Handles complex nested YAML structures

  • Identifier Extraction: Processes multiple identifier schemes per legislator

  • Data Normalization: Standardizes format across all legislator records

  • Missing Value Handling: Graceful handling of incomplete records


Data Processing Pipeline#

Step 1: Repository Data Collection#

  • Accesses OpenStates people repository

  • Identifies all YAML files containing legislator data

  • Processes each file individually

Step 2: YAML Processing#

  • Parses complex nested YAML structures

  • Extracts biographical and role information

  • Handles multiple identifier schemes per legislator

Step 3: Data Normalization#

  • Standardizes field names and formats

  • Handles missing values appropriately

  • Creates consistent data structure

Step 4: Output Generation#

  • Separates current and retired legislators

  • Exports clean datasets to CSV format

  • Validates bioguide_id completeness


Usage in Bridge Grades Pipeline#

This dataset serves as the legislator reference for bill sponsorship processing:

  1. Source A-B Processing: Enables bioguide_id matching for bill sponsorship data from Plural Policy

  2. Data Quality Assurance: Provides comprehensive legislator identification for validation

  3. Historical Analysis: Supports analysis of historical collaboration patterns

  4. Cross-Reference Validation: Ensures data consistency across different sources

Critical Role: Essential for accurate legislator identification when processing bill sponsorship data, as it provides the bioguide_id mappings required to link Plural Policy data with other Bridge Grades sources.

Notebook Walkthrough: Plural Policy Legislator Data Processing#

This notebook demonstrates the process of extracting and standardizing legislator data from the OpenStates repository to create comprehensive legislator datasets with bioguide_id mappings.

Key Steps:

  1. Repository Access: Load and parse YAML files from OpenStates repository

  2. Data Extraction: Extract biographical and role information from nested YAML structures

  3. Identifier Processing: Handle multiple identifier schemes including bioguide_id

  4. Data Standardization: Create consistent format across all legislator records

  5. Output Generation: Export current and retired legislator datasets

Expected Runtime: 1-2 minutes

# Import required libraries
import pandas as pd
import json
import os
#!pip install pyyaml
import yaml
from collections import defaultdict

Repository Access and File Discovery#

This section accesses the OpenStates people repository and identifies all YAML files containing legislator data. Each YAML file contains comprehensive biographical information for a single legislator.

Repository Structure#

  • Source: OpenStates people repository (openstates/people.git)

  • Format: Individual YAML files per legislator

  • Location: data/us/legislature/ directory

  • Coverage: Current and historical legislators

Note

Repository Setup Ensure you have cloned the OpenStates people repository locally before running this notebook. The repository contains thousands of YAML files with legislator data.

# clone repository from github at https://github.com/openstates/people.git
folder_path = "data/us/legislature"
files = [f for f in os.listdir(folder_path) if f.endswith(".yml")]
# print the files
print(files)

YAML Processing and Data Extraction#

This section processes the YAML files to extract legislator information and create a comprehensive dataset. The process handles complex nested YAML structures and multiple identifier schemes.

Processing Strategy#

  1. Two-Pass Processing: First pass identifies all identifier schemes, second pass extracts data

  2. Nested Structure Handling: Processes complex YAML hierarchies

  3. Identifier Extraction: Handles multiple identifier schemes per legislator

  4. Data Normalization: Creates consistent structure across all records

Key Data Fields#

  • Basic Information: Name, birth date, gender, contact information

  • Role Information: Chamber, district, term dates, party affiliation

  • Identifiers: bioguide_id, social media IDs, other identifier schemes

  • Metadata: Image URLs, email addresses, role history

Warning

Memory Considerations Processing thousands of YAML files can be memory-intensive. The two-pass approach helps manage memory usage by first identifying all possible identifier schemes.

# Convert all the yaml files to a dataframe
rows = []

# First pass: gather all possible identifier schemes
all_schemes = set()

for file in files:
    with open(os.path.join(folder_path, file), 'r', encoding='utf-8') as f:
        data = yaml.safe_load(f)
        for id_obj in data.get("other_identifiers", []):
            all_schemes.add(id_obj.get("scheme"))

# Now process each file with all identifier columns
for file in files:
    with open(os.path.join(folder_path, file), 'r', encoding='utf-8') as f:
        data = yaml.safe_load(f)

        # Prepare default row
        row = defaultdict(lambda: None)

        # Basic fields
        row["id"] = data.get("id")
        row["name"] = data.get("name")
        row["given_name"] = data.get("given_name")
        row["family_name"] = data.get("family_name")
        row["birth_date"] = data.get("birth_date")
        row["gender"] = data.get("gender")
        row["email"] = data.get("email")
        row["image"] = data.get("image")
        row["party"] = data.get("party", [{}])[0].get("name")

        # Roles
        role = data.get("roles", [{}])[-1]
        row["role_type"] = role.get("type")
        row["district"] = role.get("district")
        row["role_start_date"] = role.get("start_date")
        row["role_end_date"] = role.get("end_date")

        # Social media (if available)
        for k, v in data.get("ids", {}).items():
            row[f"social_{k}"] = v

        # Other identifiers
        for id_obj in data.get("other_identifiers", []):
            scheme = id_obj.get("scheme")
            identifier = id_obj.get("identifier")
            row[f"{scheme}_id"] = identifier

        rows.append(row)

# Create DataFrame
df = pd.DataFrame(rows)

# Fill in missing columns for any scheme not found in every legislator
for scheme in all_schemes:
    col_name = f"{scheme}_id"
    if col_name not in df.columns:
        df[col_name] = None
df.columns
Index(['id', 'name', 'given_name', 'family_name', 'birth_date', 'gender',
       'email', 'image', 'party', 'role_type', 'district', 'role_start_date',
       'role_end_date', 'social_twitter', 'social_facebook', 'ballotpedia_id',
       'bioguide_id', 'fec_id', 'google_entity_id_id', 'govtrack_id',
       'house_history_id', 'icpsr_id', 'opensecrets_id', 'pictorial_id',
       'thomas_id', 'votesmart_id', 'wikidata_id', 'wikipedia_id',
       'social_youtube', 'cspan_id', 'maplight_id', 'lis_id'],
      dtype='object')

Data Validation and Quality Assurance#

This section validates the processed data to ensure completeness and quality. We check for missing bioguide_id values and verify data integrity.

Validation Steps#

  1. Missing Value Check: Identify records without bioguide_id

  2. Data Completeness: Verify all expected fields are present

  3. Identifier Validation: Ensure bioguide_id format consistency

  4. Record Count Verification: Confirm expected number of legislators

Note

bioguide_id Importance The bioguide_id is the critical identifier that links this dataset with bill sponsorship data from Plural Policy. Records without bioguide_id cannot be used in the Bridge Grades pipeline.

df.head()
id name given_name family_name birth_date gender email image party role_type ... opensecrets_id pictorial_id thomas_id votesmart_id wikidata_id wikipedia_id social_youtube cspan_id maplight_id lis_id
0 ocd-person/79575558-ef44-5bb5-9c64-3d3fe3fb4427 Kweisi Mfume Kweisi Mfume 1948-10-24 Male https://mfume.house.gov/address_authentication... https://unitedstates.github.io/images/congress... Democratic lower ... N00001799 13090 00798 26892 Q519504 NaN NaN NaN NaN NaN
1 ocd-person/9db37a87-2ba9-56a0-9b42-89697222e044 Carlos Giménez Carlos Giménez 1954-01-17 Male https://gimenez.house.gov/contact https://unitedstates.github.io/images/congress... Republican lower ... N00046394 13009 NaN 81366 Q5041653 Carlos A. Giménez NaN NaN NaN NaN
2 ocd-person/84a22f15-cf83-5f0b-a048-a6fc50aa60fe Chris Smith Chris Smith 1953-03-04 Male https://chrissmith.house.gov/contact/zipauth.htm https://unitedstates.github.io/images/congress... Republican lower ... N00009816 13153 01071 26952 Q981167 Chris Smith (New Jersey politician) UCtCNUDo3-I1gsd_03ppDfZg 6411 469 NaN
3 ocd-person/bac6c65d-846b-5e60-9532-fa216c99ccf6 Mike Turner Mike Turner 1960-01-11 Male https://turner.house.gov/email-me https://unitedstates.github.io/images/congress... Republican lower ... N00025175 13204 01741 45519 Q505722 Mike Turner UC-6Ss-aZ3OPisf9GdVGtN8g 1003607 496 NaN
4 ocd-person/999df51b-9318-55b9-b0f6-d738ffc1d62d Eric Schmitt Eric Schmitt 1975-06-20 Male https://www.schmitt.senate.gov/contact/ https://unitedstates.github.io/images/congress... Republican upper ... N00048414 13416 NaN 104474 Q5387455 NaN NaN NaN NaN S420

5 rows × 32 columns

Dataset Export and Finalization#

This section exports the processed legislator data to CSV format for use in subsequent Bridge Grades processing steps.

Export Process#

  1. Data Finalization: Ensure all data is properly formatted

  2. CSV Export: Save dataset to CSV format

  3. File Naming: Use descriptive filename for easy identification

  4. Quality Check: Verify export completeness

Output Files#

  • plural_legislators_with_bioguide.csv: Current legislators with bioguide_id mappings

  • plural_legislators_retired_with_bioguide.csv: Historical legislators with bioguide_id mappings

Next Steps

These datasets will be used in the Source A-B processing notebook to enable accurate legislator identification when processing bill sponsorship data from Plural Policy.

df[["id", "name", "bioguide_id"]]
id name bioguide_id
0 ocd-person/79575558-ef44-5bb5-9c64-3d3fe3fb4427 Kweisi Mfume M000687
1 ocd-person/9db37a87-2ba9-56a0-9b42-89697222e044 Carlos Giménez G000593
2 ocd-person/84a22f15-cf83-5f0b-a048-a6fc50aa60fe Chris Smith S000522
3 ocd-person/bac6c65d-846b-5e60-9532-fa216c99ccf6 Mike Turner T000463
4 ocd-person/999df51b-9318-55b9-b0f6-d738ffc1d62d Eric Schmitt S001227
... ... ... ...
533 ocd-person/86b65fd7-4549-51fa-80e8-eaf3daf3e60e Lou Correa C001110
534 ocd-person/10de7024-4b40-57b0-ae78-101372fd4a02 Pat Ryan R000579
535 ocd-person/39f36070-b860-5345-a3cc-ea7fdbf7dfb3 Adrian Smith S001172
536 ocd-person/74d8d6c8-c349-4fe7-ae18-62c69c4f8d4b Craig Goldman G000601
537 ocd-person/bb0d60f6-19a4-5229-a67d-a1d428e7c0b2 Ben Cline C001118

538 rows × 3 columns

# are there any missing bioguide ids?
df[df['bioguide_id'].isna()].head()
id name given_name family_name birth_date gender email image party role_type ... opensecrets_id pictorial_id thomas_id votesmart_id wikidata_id wikipedia_id social_youtube cspan_id maplight_id lis_id

0 rows × 32 columns

# Save file to csv
df.to_csv("plural_legislators_with_bioguide.csv", index=False)