OpenRefine Part 1: Installing and merging datasets
If you find yourself getting bogged down when you’re working with large datasets in Excel or Google Sheets, OpenRefine might be the solution you’ve been looking for.
In this tutorial, we’ll walk you through downloading OpenRefine and how to merge multiple datasets.
Watch the video
What is OpenRefine?
OpenRefine is a powerful, free tool designed to help data professionals clean, transform and merge complex datasets that traditional spreadsheet software cannot handle. Whether you’re working with research data, performance reports or any other large dataset, OpenRefine can simplify your data management process.
Download OpenRefine
- Open your web browser. Search for ‘OpenRefine’ and navigate to the official download page.
- Select the version that matches your operating system (Windows, Mac, or Linux) and download the software.
- Once installed, clicking OpenRefine will open it in a new tab in your browser.
Prepare your datasets for merging
You need to do a quick recon of your original datasets. Open them in a spreadsheet program like Excel or GoogleSheets and make sure that:
- The column headers are consistent: Your spreadsheets must have exactly the same column headings.
- Matching column order: The columns must be in the exact same sequence for OpenRefine to merge the datasets correctly.
Pro tip: Preparing your files
- Download each dataset as a CSV (comma separated values) file. If working with multiple files, compress them into a single zip file.
- Select all the downloaded CSV files. Right-click and choose Compress to create a zip archive.
- Rename the zip file to something appropriately descriptive.
Importing the data
- Go back to OpenRefine in your browser.
- Click Choose File and upload your zipped dataset. Click the Next button.
- Review the file recognition settings: You’ll need to make sure that OpenRefine correctly identifies the file format as CSV. If not, you can manually select the correct format from the dropdown menu.
- Ensure that ‘Ignore first row’ option is unchecked if your column headings are in the first row. Otherwise, selected the appropriate number of rows to ignore.
- Click Create Project.
Viewing and verifying the merged data
OpenRefine initially displays only 10 rows. You can adjust this in the dropdown menu – choosing up to 1,000 rows to see a substantial portion of your data.
Scroll through the data to verify that the files have been merged correctly and that all the original files are represented in the merged dataset.
The full series
Part 1: Installing OpenRefine and merging datasets | WATCH
Part 2: Removing duplicates and tracking changes | WATCH
Part 3: Fixing inconsistencies with the cluster and edit tool | WATCH
Notebook
- OpenRefine’s official website and documentation
- Read more: 5 reasons to switch to OpenRefine to clean data
- Subscribe to The Outlier’s YouTube channel to be notified of new updates
- Sign up to The Outlier’s newsletter for weekly tools, tips and data-driven insights