Use the popular Pandas library for data manipulation and analysis to read data from two files and join them into a single dataset. Credit: Thinkstock In December 2019 my InfoWorld colleague Sharon Machlis wrote an article called “How to merge data in R using R merge, dplyr, or data.table.” Sharon is a whiz at R programming, and analytics in general, which comes through in the article. But suppose you’re planning on doing machine learning or deep learning on the data using Python and (for example) Scikit-learn, PyTorch, or TensorFlow? While it’s possible to pass data from R to Python, it’s not the best solution to the problem. In this article, I’ll discuss how to accomplish data merging natively in Python, which will make it easy to pass the in-memory merged dataset on to one of the Python AI frameworks. I’ll use the same public datasets as Sharon did, which record US airline flight delays, but I’ll stick to the best framework for the purpose, which is currently Pandas. [ Tune into Serdar Yegulalp’s Smart Python video tutorials to learn smart Python tricks in 5 minutes or less ] Pandas and the DataFrame Pandas is a BSD-licensed open source library that provides high-performance, easy-to-use data structures and data analysis tools for Python. The R dataframe data structure has a close equivalent in Pandas, the DataFrame. As the Pandas project describes it: DataFrame is a two-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used Pandas object. Pandas merge The Pandas method for joining two DataFrame objects is merge(), which is the single entry point for all standard database join operations between DataFrame or named Series objects. (Series objects are one-dimensional.) The merge method specification is: pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None) left : A DataFrame or named Series object. right : Another DataFrame or named Series object. on : Column or index level names to join on. Must be found in both the left and right DataFrame and/or Series objects. If not passed and left_index and right_index are False, the intersection of the columns in the DataFrames and/or Series will be inferred to be the join keys. left_on : Columns or index levels from the left DataFrame or Series to use as keys. Can either be column names, index level names, or arrays with length equal to the length of the DataFrame or Series. right_on : Columns or index levels from the right DataFrame or Series to use as keys. Can either be column names, index level names, or arrays with length equal to the length of the DataFrame or Series. left_index : If True, use the index (row labels) from the left DataFrame or Series as its join key(s). In the case of a DataFrame or Series with a MultiIndex (hierarchical), the number of levels must match the number of join keys from the right DataFrame or Series. right_index : Same usage as left_index for the right DataFrame or Series how : One of left, right, ‘outer, or inner. Defaults to inner. See below for more detailed description of each method. sort : Sort the result DataFrame by the join keys in lexicographical order. Defaults to True; setting to False will improve performance substantially in many cases. suffixes : A tuple of string suffixes to apply to overlapping columns. Defaults to ('_x', '_y'). copy : Always copy data (default True) from the passed DataFrame or named Series objects, even when reindexing is not necessary. Cannot be avoided in many cases but may improve performance and memory usage. The cases where copying can be avoided are somewhat pathological but this option is provided nonetheless. indicator : Add a column to the output DataFrame called _merge with information on the source of each row. _merge is Categorical-type and takes on a value of left_only for observations whose merge key only appears in left DataFrame or Series, right_only for observations whose merge key only appears in right DataFrame or Series, and both if the observation’s merge key is found in both. validate : string, default None. If specified, checks if merge is of specified type. one_to_one or 1:1 : checks if merge keys are unique in both left and right datasets. one_to_many or 1:m : checks if merge keys are unique in left dataset. many_to_one or m:1 : checks if merge keys are unique in right dataset. many_to_many or m:m : allowed, but does not result in checks. Read a CSV file in Pandas As you might expect, Pandas has a method for reading CSV files, pd.read_csv(), which returns a DataFrame. It has many optional arguments, but for our purposes only the basic arguments are required: filepath_or_buffer : various Either a path to a file (a str, pathlib.Path, or py._path.local.LocalPath), URL (including HTTP, FTP, and Amazon S3 locations), or any object with a read() method (such as an open file or StringIO). Sep : str, defaults to ',' for read_csv(), t for read_table() Delimiter to use. If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s built-in sniffer tool, csv.Sniffer. In addition, separators longer than one character and different from 's+' will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: 'rt'. delim_whitespace : boolean, default False Specifies whether or not whitespace (e.g. ' ' or 't') will be used as the delimiter. Equivalent to setting sep='s+'. If this option is set to True, nothing should be passed in for the delimiter parameter. US Flight Delays dataset Sharon used two tables from the US Bureau of Transportation Statistics Flight Delays dataset, the actual flight delays data and the lookup table for the reporting airline. To make this comparable to Sharon’s analysis, let’s download her files, found here. The zip file you’ll get also includes Sharon’s R code and a PowerPoint presentation. Go ahead and unzip the file rmerges3ways.zip. Then unzip the file 673598238_T_ONTIME_REPORTING.zip to yield 673598238_T_ONTIME_REPORTING.csv. We’ll use that CSV file along with L_UNIQUE_CARRIERS.csv_. Note the trailing underscore in the latter filename. The file is actually in a valid comma-separated variables format; the USBoTS site adds the underscore when it generates lookup tables. Looking at the first 20 lines of the two CSV files in a text editor (below), we see that both have header rows and do use commas as separators. That implies that the defaults of read_csv() will be fine; we just need to specify the file paths. Delay data: “FL_DATE”,”OP_UNIQUE_CARRIER”,”ORIGIN”,”DEST”,”DEP_DELAY_NEW”, 2019-08-01,”DL”,”ATL”,”DFW”,31.00, 2019-08-01,”DL”,”DFW”,”ATL”,0.00, 2019-08-01,”DL”,”IAH”,”ATL”,40.00, 2019-08-01,”DL”,”PDX”,”SLC”,0.00, 2019-08-01,”DL”,”SLC”,”PDX”,0.00, 2019-08-01,”DL”,”DTW”,”ATL”,10.00, 2019-08-01,”DL”,”ATL”,”DTW”,0.00, 2019-08-01,”DL”,”MSP”,”JFK”,22.00, 2019-08-01,”DL”,”JFK”,”MSP”,0.00, 2019-08-01,”DL”,”BHM”,”ATL”,0.00, 2019-08-01,”DL”,”ATL”,”BHM”,0.00, 2019-08-01,”DL”,”ATL”,”BOS”,17.00, 2019-08-01,”DL”,”BOS”,”ATL”,5.00, 2019-08-01,”DL”,”SEA”,”LAX”,2.00, 2019-08-01,”DL”,”LAX”,”SEA”,0.00, 2019-08-01,”DL”,”GSP”,”ATL”,0.00, 2019-08-01,”DL”,”ATL”,”GSP”,8.00, 2019-08-01,”DL”,”MSP”,”BOS”,0.00, 2019-08-01,”DL”,”BOS”,”MSP”,51.00, Lookup table: Code,Description “02Q”,”Titan Airways” “04Q”,”Tradewind Aviation” “05Q”,”Comlux Aviation, AG” “06Q”,”Master Top Linhas Aereas Ltd.” “07Q”,”Flair Airlines Ltd.” “09Q”,”Swift Air, LLC d/b/a Eastern Air Lines d/b/a Eastern” “0BQ”,”DCA” “0CQ”,”ACM AIR CHARTER GmbH” “0FQ”,”Maine Aviation Aircraft Charter, LLC” “0GQ”,”Inter Island Airways, d/b/a Inter Island Air” “0HQ”,”Polar Airlines de Mexico d/b/a Nova Air” “0J”,”JetClub AG” “0JQ”,”Vision Airlines” “0LQ”,”Metropix UK, LLP.” “0MQ”,”Multi-Aero, Inc. d/b/a Air Choice One” “0OQ”,”Open Skies” “0Q”,”Flying Service N.V.” “0QQ”,”TAG Aviation (UK) Ltd.” “0RQ”,”TAG Aviation Espana S.L.” Install Pandas and its dependencies Unless you want to check out Pandas from GitHub and compile it yourself, you can install Pandas via the Pip package installer: pip install pandas Alternatively, if you are running the Anaconda Python distribution, you can install Pandas via Conda: conda install pandas Because I have both Python 2.7 and Python 3.7 installed and I want to use the latter, I used pip3 install pandas, and discovered that I already had Pandas on my machine. I did upgrade it, though, as the version I had was kind of old. martinheller@Martins-Retina-MacBook ~ % pip3 install pandas Requirement already satisfied: pandas in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (0.24.2) Requirement already satisfied: pytz>=2011k in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from pandas) (2019.1) Requirement already satisfied: python-dateutil>=2.5.0 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from pandas) (2.7.5) Requirement already satisfied: numpy>=1.12.0 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from pandas) (1.16.3) Requirement already satisfied: six>=1.5 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from python-dateutil>=2.5.0->pandas) (1.11.0) martinheller@Martins-Retina-MacBook ~ % pip3 install --upgrade pandas Collecting pandas Downloading pandas-1.1.4-cp37-cp37m-macosx_10_9_x86_64.whl (9.9 MB) |████████████████████████████████| 9.9 MB 4.2 MB/s Requirement already satisfied, skipping upgrade: python-dateutil>=2.7.3 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from pandas) (2.7.5) Requirement already satisfied, skipping upgrade: pytz>=2017.2 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from pandas) (2019.1) Requirement already satisfied, skipping upgrade: numpy>=1.15.4 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from pandas) (1.16.3) Requirement already satisfied, skipping upgrade: six>=1.5 in /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas) (1.11.0) Installing collected packages: pandas Attempting uninstall: pandas Found existing installation: pandas 0.24.2 Uninstalling pandas-0.24.2: Successfully uninstalled pandas-0.24.2 Successfully installed pandas-1.1.4 Python code to merge data At this point, the code pretty much writes itself. import pandas as pd mydf = pd.read_csv("673598238_T_ONTIME_REPORTING.csv") mylookup = pd.read_csv("L_UNIQUE_CARRIERS.csv_") mydf.head(10) mylookup.head(10) joined_df = pd.merge(mydf, mylookup, left_on= "OP_UNIQUE_CARRIER", right_on="Code") joined_df.head(10) If we run these lines in the Python 3 console, we get: martinheller@Martins-Retina-MacBook rmerges3ways % python3 Python 3.7.2 (v3.7.2:9a3ffc0492, Dec 24 2018, 02:44:43) [Clang 6.0 (clang-600.0.57)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import pandas as pd >>> mydf = pd.read_csv("673598238_T_ONTIME_REPORTING.csv") >>> mydf.head(10) FL_DATE OP_UNIQUE_CARRIER ORIGIN DEST DEP_DELAY_NEW Unnamed: 5 0 2019-08-01 DL ATL DFW 31.0 NaN 1 2019-08-01 DL DFW ATL 0.0 NaN 2 2019-08-01 DL IAH ATL 40.0 NaN 3 2019-08-01 DL PDX SLC 0.0 NaN 4 2019-08-01 DL SLC PDX 0.0 NaN 5 2019-08-01 DL DTW ATL 10.0 NaN 6 2019-08-01 DL ATL DTW 0.0 NaN 7 2019-08-01 DL MSP JFK 22.0 NaN 8 2019-08-01 DL JFK MSP 0.0 NaN 9 2019-08-01 DL BHM ATL 0.0 NaN >>> mylookup = pd.read_csv("L_UNIQUE_CARRIERS.csv_") >>> mylookup.head(10) Code Description 0 02Q Titan Airways 1 04Q Tradewind Aviation 2 05Q Comlux Aviation, AG 3 06Q Master Top Linhas Aereas Ltd. 4 07Q Flair Airlines Ltd. 5 09Q Swift Air, LLC d/b/a Eastern Air Lines d/b/a E... 6 0BQ DCA 7 0CQ ACM AIR CHARTER GmbH 8 0FQ Maine Aviation Aircraft Charter, LLC 9 0GQ Inter Island Airways, d/b/a Inter Island Air >>> joined_df = pd.merge(mydf, mylookup, left_on= "OP_UNIQUE_CARRIER", right_on="Code") >>> joined_df.head(10) FL_DATE OP_UNIQUE_CARRIER ORIGIN ... Unnamed: 5 Code Description 0 2019-08-01 DL ATL ... NaN DL Delta Air Lines Inc. 1 2019-08-01 DL DFW ... NaN DL Delta Air Lines Inc. 2 2019-08-01 DL IAH ... NaN DL Delta Air Lines Inc. 3 2019-08-01 DL PDX ... NaN DL Delta Air Lines Inc. 4 2019-08-01 DL SLC ... NaN DL Delta Air Lines Inc. 5 2019-08-01 DL DTW ... NaN DL Delta Air Lines Inc. 6 2019-08-01 DL ATL ... NaN DL Delta Air Lines Inc. 7 2019-08-01 DL MSP ... NaN DL Delta Air Lines Inc. 8 2019-08-01 DL JFK ... NaN DL Delta Air Lines Inc. 9 2019-08-01 DL BHM ... NaN DL Delta Air Lines Inc. [10 rows x 8 columns] And there we have it. Or do we? If you run the code as a file from the command line, instead of line by line in the interactive Python console, there won’t be any output. We need to add some print() function calls: import pandas as pd mydf = pd.read_csv("673598238_T_ONTIME_REPORTING.csv") mylookup = pd.read_csv("L_UNIQUE_CARRIERS.csv_") print( mydf.head(10) ) print( mylookup.head(10) ) joined_df = pd.merge(mydf, mylookup, left_on= "OP_UNIQUE_CARRIER", right_on="Code") print( joined_df.head(10) ) Now python3 pymerge.py will emit the output we want. PythonData ScienceAnalytics