get pandas DataFrame from an API endpoint that lacks order?

3 years ago

Ava Mitchell

33 minutes

A pandas DataFrame is a two-dimensional data structure that presents data in rows and columns, similar to an Excel spreadsheet or SQL table. It is commonly utilized in data-intensive fields including scientific research, data science, machine learning, and data analysis.

By utilizing Python, you have the capability to transform data from CSV, Excel, or API endpoints into a pandas DataFrame, allowing you to manipulate the acquired DataFrame. Although converting Excel and CSV files is typically uncomplicated, the process of creating DataFrames from API endpoints can present difficulties due to their potential variations.

There are API endpoints that provide a user-friendly format of dictionaries, with keys representing column names and values representing the corresponding elements. However, there are also API endpoints that lack clear context in representing column names and values. Converting such unordered API endpoints into a DataFrame can prove to be difficult.

In this tutorial, you will learn how to gather data from a disorganized endpoint and present it in a compatible format for pandas DataFrame. Initially, you will retrieve data from various common API formats to understand how to convert structured API data into a DataFrame. Subsequently, you will handle an unstructured API endpoint. By collecting column names and their corresponding values, you will generate a dictionary. Lastly, you will utilize this dictionary to create a pandas DataFrame that will be useful for conducting data analysis.

Requirements

In order to proceed with this tutorial, the items required are:

Python 3, pip, and the Python venv module installed on your machine. For Ubuntu and other Linux distributions, follow Steps 1 and 2 of our tutorial on How To Install Python 3 and Set Up a Programming Environment. For macOS, follow Steps 1–5 of our tutorial on How To Install Python 3 and Set Up a Local Programming Environment on macOS.
Jupyter notebook installed and running on your machine. For remote servers running Ubuntu/Debian, follow How to Install, Run, and Connect to Jupyter Notebook on a Remote Server. For other operating systems, check the official documentation for installing Jupyter.
A modern web browser that you will use to access Jupyter Notebook.
Familiarity with the fundamentals of Python, such as for loops, functions, and data structures (in particular, lists and dictionaries).
Experience with the Python pandas module for data manipulation and analysis.

First, get your Jupyter Notebook environment ready.

You will be using Jupyter Notebook for this tutorial in order to work with the data. Jupyter Notebook is beneficial for iterative coding as it enables you to write a short piece of code, execute it, and obtain the output.

In order to proceed with this tutorial on your Jupyter Notebook, you will have to create a new Notebook and install the necessary dependencies, as instructed in this step.

After completing the necessary steps, establish a Jupyter Notebook environment on your server. When you are logged into your server, enable the virtual environment.

source ~/environments/my_env/bin/activate

Start the Jupyter Notebook application to initiate the program.

jupyter notebook

Note

If you are using a remote server for the tutorial, you can access your Jupyter Notebook on your local machine’s browser by using port forwarding. Just open a new terminal and input the following command: ssh -L 8888:localhost:8888 your_non_root_user@your_server_ip. After connecting to the server, go to the link provided in the output to access your Jupyter Notebook. Remember to keep this terminal open while following the rest of the tutorial.

Once you have executed and established a connection, you will be able to navigate through a user interface in your web browser. Choose the Python3(ipykernel) option from the New dropdown menu to open a new tab containing an untitled Python notebook. Assign a name to your file, for example, convert_data.ipynb, as shown in this tutorial.

Next, in the initial cell of your browser’s Jupyter Notebook, utilize pip for installing the required dependencies.

Please provide an explanation or more context for “convert_data.ipynb” to paraphrase it natively

pip install pandas
pip install requests

The only way to access the DataFrame data structure is by using the pandas module. If you want to access a URL, you can make use of the requests module.

Once you have included the commands in your Jupyter Notebook, click on the Run button to execute them.

Your Jupyter Notebook will display a continuous output to show that the dependencies are being downloaded. You will find a new input cell below this output where you can execute the subsequent lines of code.

After that, execute the given commands to import the modules that you have recently installed.

Please provide an explanation and an example for “convert_data.ipynb”.

import pandas as pd
import requests

Using pd as a shorthand, you can easily refer to the pandas module by simply typing pd.

After completing this stage, you will have set up your Jupyter Notebook environment. Then, you can proceed to conduct some basic conversions to generate DataFrames using dictionaries.

Step 2 involves the conversion of ordered API data into a pandas DataFrame.

To fetch data from an API endpoint without any specific order, begin by generating DataFrames from the assorted API data available in various formats, which can be found at a specific URL.

Typically, in several API endpoints, the keys generally indicate the names of the columns, while the values linked to those keys denote the elements associated with each column. However, certain API endpoints do not follow this format. In such cases, a technique is required to extract the column names and their corresponding data in order to generate a comprehensive dictionary. By familiarizing yourself with various established approaches to transforming traditional API endpoints into pandas DataFrames, you will find assistance in accomplishing this objective.

In this particular stage, you will employ various conventional techniques to generate DataFrames from dictionaries of diverse formats.

One possible paraphrase could be:

Approach 1 – Generating a DataFrame using a List of Dictionaries.

In this approach, every dictionary inside the list corresponds to a single row of data in the resulting DataFrame. The DataFrame will have keys that indicate the column names and values representing the elements within each column.

To implement this approach, you will generate a DataFrame comprising the Age, ID, and Name columns along with the details of three individuals.

Run the following code in a new cell to arrange a collection of dictionaries, which is assigned to the variable dict1 (you can modify the name as per your preference).

create_data_transformation.ipynb

dict1=[{"Age":41,"ID":1234,"Name":"Bella"},{"Age":24,"ID":5678,"Name":"Mark"},{"Age":26,"ID":2341,"Name":"John"}]

data=pd.DataFrame(dict1)

data.head()

You provide a list of dictionaries to the variable dict1. Each dictionary in the list corresponds to a row in the resulting DataFrame. The keys Age, ID, and Name represent the names of the columns. The assigned values for each key represent the values of that column for the respective row. Finally, you pass this dictionary to the pandas module in order to generate a DataFrame.

The DataFrame mentioned below will be displayed:

Output

Age ID Name 0 41 1234 Bella 1 24 5678 Mark 2 26 2341 John

The DataFrame was generated using a list of dictionaries. The column names in the DataFrame correspond to the keys you specified in each dictionary of the dict1 list, and the row-wise data is represented by the values in these dictionaries.

Option: Generate a DataFrame by utilizing a single dictionary

In the method mentioned earlier, you defined the data by rows, which resulted in writing multiple dictionaries within a list. Each dictionary represented data arranged row by row, where the column names served as keys and the elements belonging to the columns were the corresponding values.

This time, you will specify the data in a vertical manner for each column using a dictionary. The dictionary will have column names as its keys and lists as its values, containing the corresponding vertical data for each column.

To utilize this approach, you may employ the identical data employed in the preceding section. Execute the subsequent commands within a fresh cell.

Please provide the necessary conversion for “convert_data.ipynb”.

dict2={"Age":[41,24,26],"ID":[1234,5678,2341],"Name":["Bella","Mark","John"]}

data=pd.DataFrame(dict2)

data.head()

In this case, column-wise data is specified using a dictionary. The dict2 dictionary contains keys that correspond to the column names: Age, ID, and Name. Each key in the dictionary holds a list that represents the data for the respective column.

The screen will display the DataFrame below.

Output

Age ID Name 0 41 1234 Bella 1 24 5678 Mark 2 26 2341 John

Each column in the dict2 dictionary is represented by its respective key (Age, ID, and Name), and each column contains the list you specified as its value. For instance, the Age column contains the numbers 41, 24, and 26.

As anticipated, both techniques yield identical outcomes. While the initial two techniques produce the same outputs, you have the option to construct a DataFrame using either a row-wise or column-wise approach.

One possible rephrase could be:

Option 1: Generating a DataFrame using data from an API located at a specific URL.

You will generate a DataFrame using API data retrieved from a URL in this approach. In API-based URLs, the data is typically displayed as a list of dictionaries, similar to the data used in Method 1. However, instead of manually inputting the data, you will utilize the requests module to access the API data from a URL, enabling URL access.

To implement this approach, you will utilize a JSON API URL called “Countries with Capitals,” which provides a structured presentation of countries and their respective capitals. To access the API, simply open the given URL in your web browser.

https://raw.githubusercontent.com/dariusk/corpora/master/data/geography/countries_with_capitals.json

This is the initial section of the API data displayed by the URL.

One option for paraphrasing could be:

“A compilation of nations along with their respective capitals, derived from the API data gathered from the URL of Countries with Capitals.”

{
  "description": "A list of countries and its respective capitals.",
  "countries": [
    {"name":"Afghanistan", "capital":"Kabul"},
    {"name":"Albania", "capital":"Tirana"},
    {"name":"Algeria", "capital":"Algiers"},
    {"name":"Andorra", "capital":"Andorra la Vella"},
    {"name":"Angola", "capital":"Luanda"},
    {"name":"Antigua & Barbuda", "capital":"St. John's"},
    {"name":"Argentina", "capital":"Buenos Aires"},
    {"name":"Armenia", "capital":"Yerevan"},
    {"name":"Australia", "capital":"Canberra"},
    {"name":"Austria", "capital":"Vienna"},
...

The API is displayed in a familiar structure: the main dictionary includes a list of dictionaries under the “countries” key. Each inner dictionary represents a row of data, with “name” and “capital” as the column names and their respective values.

To generate a pandas DataFrame from this URL, execute the subsequent code in a fresh cell:

One possible paraphrase could be: “Transform the data in the convert_data.ipynb file.”

response=requests.get('https://raw.githubusercontent.com/dariusk/corpora/master/data/geography/countries_with_capitals.json').json()

df1=pd.DataFrame(response['countries'])

df1.head()

Using the requests module, you can retrieve data from a URL. In this example, the URL is read as a json file and the data is stored in the variable response. Then, the list of dictionaries (found under the countries key) is passed to pandas to create a DataFrame. df1 will be the resulting DataFrame after successfully fetching the data from the specified API endpoint.

Once you execute the mentioned code, you will obtain the ensuing DataFrame.

Output

name capital 0 Afghanistan Kabul 1 Albania Tirana 2 Algeria Algiers 3 Andorra Andorra la Vella 4 Angola Luanda

After inputting the list of countries and capitals into the pandas DataFrame module, you will obtain this DataFrame. By running this code, you passed the list of dictionaries, found as a value under the countries key, to the pandas DataFrame module. The column names are name and capital, which are extracted from the keys in each dictionary within the list. The values in the dictionaries represent the data for each row, as each dictionary corresponds to a separate row.

Up until now, you have formed DataFrames using simple dictionaries and structured API endpoints. When working with URLs that contain structured APIs, the information is typically displayed in the format demonstrated in Method 1, where it is presented as a list of dictionaries, with each dictionary representing data in a row-wise manner. If the API data is not organized, you have the option to gather column names and their corresponding data using Method 1 or Method 2, either row-wise or column-wise. In the upcoming stage, you will blend various tactics from this step to retrieve data from an unstructured API endpoint.

Step 3 involves transforming unordered API data into a pandas DataFrame.

Up until now, you have converted ordered data into pandas Dataframe using various techniques. However, you will now be dealing with data in a different format. The API data from this new URL does not follow a typical format, which poses challenges in determining the column names and their corresponding data.

To create a DataFrame from this unorganized data, you need to perform extra actions. Initially, extract the names of the columns and the values from all the columns. Then, set the column names as keys and the column elements (as a list) as values in a dictionary. Eventually, you will acquire a well-structured dictionary resembling the Method 2 format, which can be provided to pandas for converting it into a DataFrame.

This stage involves utilizing information sourced from the NY School Demographics and Accountability Snapshot.

Keeping track of the names of columns in the API endpoint.

In this part, you will monitor data pertaining to the columns of the data contained within the API endpoint.

To reach the sample data, simply open this URL using any web browser you prefer.

https://data.cityofnewyork.us/api/views/ihfw-zy9j/rows.json

The information in this web address appears to be less organized compared to the data you dealt with in Step 2. The NY Snapshot data consists of various dictionaries, but they do not follow the conventional format where keys represent column names and values represent the elements within those columns. There are no clear dictionaries that display the data either row-wise or column-wise.

To start, you need to locate the DataFrame’s column information. While accessing the API URL on your browser, use the CTRL+F command to search for columns.

The information obtained from the NY School Snapshot URL, as provided by the Columns Section API data.

...
"columns" : [ {
        "id" : -1,
        "name" : "sid",
        "dataTypeName" : "meta_data",
        "fieldName" : ":sid",
        "position" : 0,
        "renderTypeName" : "meta_data",
        "format" : { },
        "flags" : [ "hidden" ]
      }, {
        "id" : -1,
        "name" : "id",
        "dataTypeName" : "meta_data",
        "fieldName" : ":id",
        "position" : 0,
        "renderTypeName" : "meta_data",
        "format" : { },
        "flags" : [ "hidden" ]
      }, {
        "id" : -1,
        "name" : "position",
        "dataTypeName" : "meta_data",
        "fieldName" : ":position",
        "position" : 0,
        "renderTypeName" : "meta_data",
        "format" : { },
        "flags" : [ "hidden" ]
...

There are multiple dictionaries in the list of values for the columns key. In each dictionary, observe the values of the dataTypeName keys. For the initial eight dictionaries, the value of the dataTypeName key is meta_data. Metadata refers to information that describes data. In this instance, metadata does not pertain to the actual data in the table, but rather it describes the table as a whole. Therefore, these eight dictionaries (where dataTypeName key has meta_data as the value) do not provide details about the columns present in the table.

However, as you progress down this list, take note of the initial occurrence of a dataTypeName key that does not possess meta_data as its value.

The information displayed in the Columns Section is extracted from the NY School Snapshot URL using the API data.

...
}, {
       "id" : -1,
       "name" : "meta",
       "dataTypeName" : "meta_data",
       "fieldName" : ":meta",
       "position" : 0,
       "renderTypeName" : "meta_data",
       "format" : { },
       "flags" : [ "hidden" ]
     }, {
       "id" : 45985351,
       "name" : "DBN",
       "dataTypeName" : "text",
       "fieldName" : "dbn",
       "position" : 1,
       "renderTypeName" : "text",
       "tableColumnId" : 8032537,
       "cachedContents" : {
         "non_null" : "10075",
...

The initial dictionary, assigned with an id value of 45985351, diverges from the previous dictionaries by having a value of “text” for the dataTypeName key instead of “meta_data”. This specific dictionary contains details about the first column of your upcoming dataset. Notably, the data within this column is formatted as text, or in simpler terms, a string. Additionally, this first dictionary includes a name key with a DBN value, signifying that DBN is the designated name for your initial column. Subsequently, all subsequent dictionaries (found within the columns key) stray from assigning “meta_data” as the value for their respective dataTypeName key, indicating that these dictionaries illustrate information regarding the subsequent columns within your DataFrame.

Once you have a fundamental knowledge of where to access information regarding the columns of your table, the subsequent step involves acquiring the names of these columns.

Locating the Names of Columns

In this part, you will include the names of all the columns from the API endpoint into a list named columns. Add the subsequent lines to a fresh cell in your Jupyter Notebook and execute the cell:

rewrite_data_file.ipynb

response=requests.get(‘https://data.cityofnewyork.us/api/views/ihfw-zy9j/rows.json’).json()
columns=[]
for i in response[‘meta’][‘view’][‘columns’]:
if(i[‘dataTypeName’]==‘meta_data’):
continue
else:
columns.append(i[‘name’])

Just like in Step 2, Method 3, you will use the requests module to read the data from the URL and store it in a variable called response.

In the third line, you specify the list of columns. In the fifth line, you specify the columns key within the values of the view key, which is itself defined within the values of the meta key.

The for loop goes through each dictionary in the list linked to the columns key and examines the dataTypeName values of those dictionaries. If the current dictionary’s dataTypeName value is meta_data, you move on to the next dictionary. If it’s not, you add the value of the name key to the columns list.

Afterward, verify the content of the list of columns by executing the subsequent line in a fresh cell:

Can you give me a paraphrase y/n?

columns

The names of the columns will be displayed on the screen.

Output

[‘DBN’, ‘Name’, ‘schoolyear’, ‘fl_percent’, ‘frl_percent’, ‘total_enrollment’, ‘prek’, ‘k’, ‘grade1’, ‘grade2’, ‘grade3’, ‘grade4’, ‘grade5’, ‘grade6’, ‘grade7’, ‘grade8’, ‘grade9’, ‘grade10’, ‘grade11’, ‘grade12’, ‘ell_num’, ‘ell_percent’, ‘sped_num’, ‘sped_percent’, ‘ctt_num’, ‘selfcontained_num’, ‘asian_num’, ‘asian_per’, ‘black_num’, ‘black_per’, ‘hispanic_num’, ‘hispanic_per’, ‘white_num’, ‘white_per’, ‘male_num’, ‘male_per’, ‘female_num’, ‘female_per’]

The result verifies that all columns, except those with a dataTypeName identified as meta_data, have been included in the columns list.

In this part, you acquired a collection of the column titles from the disorganized API endpoint. These titles will be necessary for constructing your ultimate DataFrame. Subsequently, you’ll extract the data that corresponds to the columns specified in the provided list.

Creating a List containing n Sublists

In this segment, you will ascertain the quantity of columns in the API and establish a roster of sublists that aligns with this quantity.

Finally, by applying Method 2 of Step 2, you will generate a pandas DataFrame using a dictionary. The keys in the dictionary denote the column names and the corresponding values are lists representing the elements of each column. Once you ascertain the number of columns in the API, you will create a list consisting of sublists, with each sublist containing data for their respective columns.

After creating a dictionary using this method, you can dynamically add elements to it by looping through the columns list and the list of sublists. For instance, the sublist at index 0 will contain elements for the column name at index 0 of the columns list, and the sublist at index 1 will contain elements for the column name at index 1 of the columns list.

In the previous step, you acquired a list consisting of the names of the columns. Now, you will verify the size of this list by executing this line in a fresh cell block:

rewrite_data.ipynb

len(columns)

The columns list has a function called len() that provides the count of items in it.

The API will display the total count of columns on the screen.

Output

As you have 38 columns, you need to create a list consisting of 38 sublists. Each sublist will hold the data for its respective column.

To generate a list, named d, consisting of 38 sublists, execute the given code in a fresh cell.

Please rephrase the given phrase: “convert_data.ipynb”.

d = [[] for x in  range(len(columns))]

d

To generate the list, you should assign a name to it, for example, d. Every array within the dataset ([]) will form a fresh sublist that comprises the data obtained from the range of 38 columns discovered earlier.

The list below displays 38 sublists.

Output

[[], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], [], []]

These sublists are empty at the beginning since you have not added any elements to them yet.

38 sublists have been created within the list d. These 38 sublists will contain column-wise data for the 38 columns. The next step is to add the content to these sublists.

Getting the values from the columns.

In this part, you will retrieve the data for each column from the API endpoint and add it to the 38 sublists you formed earlier. To accomplish this, you must first locate where the data key is located within the entire API. The value(s) associated with the data key will contain the information within the table.

When you open the API URL in your browser and press CTRL+F, look for the keyword “data”. You will find approximately 73 instances of it. Find the final occurrence, where you will discover the following information.

The section displaying subcategories in the data (API data obtained from the NY School Snapshot URL).

...
"data" : [ [ "row-h6zp~rx75.iwiv", "00000000-0000-0000-C205-81EF0C7F0969", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20052006", "89.4", null, "281", "15", "36", "40", "33", "38", "52", "29", "38", null, null, null, null, null, null, "36", "12.8", "57", "20.3", "25", "9", "10", "3.6", "74", "26.3", "189", "67.3", "5", "1.8", "158", "56.2", "123", "43.8" ]
, [ "row-bgbf_ntuw.ckdq", "00000000-0000-0000-9AF5-4D58BAF51C20", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20062007", "89.4", null, "243", "15", "29", "39", "38", "34", "42", "46", null, null, null, null, null, null, null, "38", "15.6", "55", "22.6", "19", "15", "18", "7.4", "68", "28", "153", "63", "4", "1.6", "140", "57.6", "103", "42.4" ]
, [ "row-mspc-8wz5_uxb8", "00000000-0000-0000-9E11-73B99A1B02D9", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20072008", "89.4", null, "261", "18", "43", "39", "36", "38", "47", "40", null, null, null, null, null, null, null, "52", "19.9", "60", "23", "20", "14", "16", "6.1", "77", "29.5", "157", "60.2", "7", "2.7", "143", "54.8", "118", "45.2" ]
, [ "row-p6za~9ikt~ums7", "00000000-0000-0000-D2CD-5904BA5DC16E", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20082009", "89.4", null, "252", "17", "37", "44", "32", "34", "39", "49", null, null, null, null, null, null, null, "48", "19", "62", "24.6", "21", "17", "16", "6.3", "75", "29.8", "149", "59.1", "7", "2.8", "149", "59.1", "103", "40.9" ]
, [ "row-55ss_ktcg~w7ru", "00000000-0000-0000-9425-35118FA9200F", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20092010", "   ", "96.5", "208", "16", "40", "28", "32", "30", "24", "38", null, null, null, null, null, null, null, "40", "19.2", "46", "22.1", "14", "14", "16", "7.7", "67", "32.2", "118", "56.7", "6", "2.9", "124", "59.6", "84", "40.4" ]
, [ "row-ugmn-sxmy_fyiu", "00000000-0000-0000-2070-ABC0343F1148", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20102011", "   ", "96.5", "203", "13", "37", "35", "33", "30", "30", "25", null, null, null, null, null, null, null, "30", "14.8", "46", "22.7", "21", "9", "13", "6.4", "75", "36.9", "110", "54.2", "4", "2", "113", "55.7", "90", "44.3" ]
, [ "row-a4rf-xqy7~fvkh", "00000000-0000-0000-AF7F-41FA5EB8E91E", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE                       ", "20112012", null, "89.4", "189", "13", "31", "35", "28", "25", "28", "29", "    ", "    ", "    ", "    ", "    ", "    ", "    ", "20", "10.6", "40", "21.2", "23", "7", "12", "6.3", "63", "33.3", "109", "57.7", "4", "2.1", "97", "51.3", "92", "48.7" ]
...

Every subcategory inside the list connected to the data key represents the values of one particular row in your potential DataFrame. In order to find out at what point the actual data for each row starts, you must compare the number of columns with the number of elements in a single subcategory of the data section.

To verify the number of columns, execute the given code in a fresh cell.

Please convert the data into a different format using the convert_data.ipynb file.

len(columns)

The number of columns will be displayed in the following output.

Output

Afterwards, examine the length of the initial subset in data, which represents the number of elements in the initial row.

Please rephrase the given statement “convert_data.ipynb” natively, providing one option.


len(response['data'][0])

You indicate the initial sublist within data using its ID number as 0.

The resulting output will display the count of elements in the initial subarray of data.

Output

You might assume that the number of elements in the first sublist (46) would match the number of columns (38) since the data is organized row-wise. However, each sublist has a length of 46, which is 8 more than 38. Remember that you identified 8 dictionaries within the ‘columns’ key that had ‘meta_data’ as the value for the ‘dataTypeName’ key. Hence, these additional 8 values at the start of these sublists indicate metadata values and do not reflect the table’s contents.

Because lists start indexing at 0, it is important to note that the sublists within the range of 0 to 7 (inclusive) do not depict the DataFrame contents; instead, they refer to metadata values. Hence, the data in the rows can be found starting from index 8.

To append elements from the data sublists, use a loop that starts from index 38 and goes up to index 46 (8 increments). For testing purposes, only extract the data from the first row instead of extracting data from all the sublists. Execute these lines in a fresh cell.

Can you please rephrase the given statement?

count=0

for i in response['data'][0][len(response['data'][0])-len(columns):len(response['data'][0])]: #from index 8 to 45

    d[count].append(i)

    count=count+1

d

In this scenario, the count variable is utilized to add the columnar data from the first row to the nth sublist of d.

Next, the code goes through the initial sublist in “data” and retrieves the elements within the range of index 8 to index 45, resulting in a total of 38 elements.

The length of the first sublist within the data key is denoted as length(response[‘data’][0]), and it represents the number of elements in that sublist. On the other hand, len(columns) signifies the number of columns in the final DataFrame.

Here is a summary of the initial few iterations. The count starts at 0. Upon entering the for loop, the value of i becomes response[‘data’][0][8], which is the first non-metadata value in the first sublist. Afterwards, you add this value of i to d[0], using the current count value of 0 within the brackets. Consequently, you add this value to the first sublist in the list d.

The count increases by 1. If i is equal to response[‘data’][0][9], you add the i value to d[1], which is now the updated sublist since the count was incremented in the previous iteration. This value is then added to the second sublist in the list d. This process continues until index 45 is reached. Finally, the last line of code displays the updated list d with its sublists.

The information from the initial row is presented in the output.

Output

[[’01M015′], [‘P.S. 015 ROBERTO CLEMENTE’], [‘20052006’], [‘89.4’], [None], [‘281′], [’15’], [’36’], [’40’], [’33’], [’38’], [’52’], [’29’], [’38’], [None], [None], [None], [None], [None], [None], [’36’], [‘12.8′], [’57’], [‘20.3′], [’25’], [‘9′], [’10’], [‘3.6′], [’74’], [‘26.3’], [‘189’], [‘67.3’], [‘5’], [‘1.8’], [‘158’], [‘56.2’], [‘123’], [‘43.8’]]

To verify if the initial row has accurate information, cross-reference it with the initial sub-list in the list acting as a value for the data key in the API URL.

Contents of the initial sublist (API data retrieved from the NY School Snapshot URL).

...
"data" : [ [ "row-h6zp~rx75.iwiv", "00000000-0000-0000-C205-81EF0C7F0969", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20052006", "89.4", null, "281", "15", "36", "40", "33", "38", "52", "29", "38", null, null, null, null, null, null, "36", "12.8", "57", "20.3", "25", "9", "10", "3.6", "74", "26.3", "189", "67.3", "5", "1.8", "158", "56.2", "123", "43.8" ]
...

To see the entire data snippet, you might have to scroll towards the right of the box mentioned above.

The initial values (starting from index 0 to 7) pertain to the metadata columns. The emphasized values in the initial sublist (from index 8 to 45) correspond to the extracted values of the first row, which you have successfully obtained.

In order to execute the same operation for every row, you must modify the previous code. Initially, redefine the d list which consists of 38 sublists by executing the following command in a fresh cell:

Could you please provide more context or information about “convert_data.ipynb”?

d = [[] for x in  range(len(columns))]

You are changing the definition of the list d by including the elements from all the rows, not just the first row. Your goal is to replicate this operation for all the remaining rows, including the first row.

After generating a fresh set of 38 sublists, execute the provided code in a separate cell to extract data from each and every row of the potential DataFrame.

Please convert the data in the “convert_data.ipynb” file.

for i in response[‘data’]:
count=0
for j in range(len(response[‘data’][0])–len(columns),len(response[‘data’][0])):
d[count].append(i[j])
count+=1

In the first line, the assignment of i to response[‘data’] indicates that i is sequentially assigned to each sublist.

Here’s a summary of some iterations. Initially, i represents the first sublist assigned to the data key within the list, while the count is 0.

In Python, when using line 3, you can access and iterate through the elements of all the sublists, specifically from index 8 to 46. However, it’s important to note that the upper limit of for loops is exclusive in Python, so in this case, you will obtain elements ranging from indexes 8 to 45.

The action d[count].append(i[j]) represents adding i[8] to the first sublist of d. This is because count is currently 0 and j is now 9, so the element at index 8 in the sublist i is appended to the first sublist in d. As a result, count is increased by 1 and the inner loop continues.

Now, in line 4, we have assigned d[1] as the first sublist (still referred to as i in the outer loop). This means that the element at index 9 of the first sublist is added to the second sublist of the list d. As a result, the count is increased. The inner loop keeps going until all the values of a single row have been added to all the sublists of d.

After the inner loop ends, the action of adding the final value from the first row would be represented as: d[37].append(i[45]). At the completion of the first iteration of i, the list d remains unchanged from the previous outcome, containing only one element in each sublist of d.

Once you exit the inner loop, you move on to the outer loop where i is given the second sublist (next sublist) from the list assigned to the data key. The count value is reset to 0 at this point.

In the second iteration of i, when the inner loop is broken, all the sublists in d will have two values. The second value in each sublist corresponds to data from the second row. This pattern continues for all the sublists within the list assigned to the data key.

To provide an example of the appearance of the acquired data within the sublists of d, you will extract the initial five elements from the initial three sublists of d. In the resultant DataFrame, these values will represent the initial five elements of the initial three columns.

Start by examining the initial five elements of the sublist located at index 0 of d.

Please rewrite “convert_data.ipynb” in a different way using your own words.

d[0][0:5]

The following will be shown by the output.

Output

[’01M015′, ’01M015′, ’01M015′, ’01M015′, ’01M015′]

To confirm the output, you can compare it with the data section of the API as displayed in your web browser.

Contents from the initial sublist (Data obtained from the NY School Snapshot URL using API).

...
"data" : [ [ "row-h6zp~rx75.iwiv", "00000000-0000-0000-C205-81EF0C7F0969", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20052006", "89.4", null, "281", "15", "36", "40", "33", "38", "52", "29", "38", null, null, null, null, null, null, "36", "12.8", "57", "20.3", "25", "9", "10", "3.6", "74", "26.3", "189", "67.3", "5", "1.8", "158", "56.2", "123", "43.8" ]
, [ "row-bgbf_ntuw.ckdq", "00000000-0000-0000-9AF5-4D58BAF51C20", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20062007", "89.4", null, "243", "15", "29", "39", "38", "34", "42", "46", null, null, null, null, null, null, null, "38", "15.6", "55", "22.6", "19", "15", "18", "7.4", "68", "28", "153", "63", "4", "1.6", "140", "57.6", "103", "42.4" ]
, [ "row-mspc-8wz5_uxb8", "00000000-0000-0000-9E11-73B99A1B02D9", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20072008", "89.4", null, "261", "18", "43", "39", "36", "38", "47", "40", null, null, null, null, null, null, null, "52", "19.9", "60", "23", "20", "14", "16", "6.1", "77", "29.5", "157", "60.2", "7", "2.7", "143", "54.8", "118", "45.2" ]
, [ "row-p6za~9ikt~ums7", "00000000-0000-0000-D2CD-5904BA5DC16E", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20082009", "89.4", null, "252", "17", "37", "44", "32", "34", "39", "49", null, null, null, null, null, null, null, "48", "19", "62", "24.6", "21", "17", "16", "6.3", "75", "29.8", "149", "59.1", "7", "2.8", "149", "59.1", "103", "40.9" ]
, [ "row-55ss_ktcg~w7ru", "00000000-0000-0000-9425-35118FA9200F", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20092010", "   ", "96.5", "208", "16", "40", "28", "32", "30", "24", "38", null, null, null, null, null, null, null, "40", "19.2", "46", "22.1", "14", "14", "16", "7.7", "67", "32.2", "118", "56.7", "6", "2.9", "124", "59.6", "84", "40.4" ]
,
...

To see the entire data snippet, it may be necessary for you to scroll towards the right side of the box above.

The values you acquired correspond to the emphasized values, present at index 8 within each sublist. The elements within the index range of 0-7 (inclusive) in these sublists indicate metadata values rather than the expected DataFrame values. Ultimately, these values will represent the initial five elements of the column located at index 0 in the columns.

Afterwards, you have the option to examine the initial five elements of the subcategory located at index 1 within d.

rewrite_data.ipynb

d[1][0:5]

Only the first five elements of the second sublist will be displayed on the screen.

Output

[‘P.S. 015 ROBERTO CLEMENTE’, ‘P.S. 015 ROBERTO CLEMENTE’, ‘P.S. 015 ROBERTO CLEMENTE’, ‘P.S. 015 ROBERTO CLEMENTE’, ‘P.S. 015 ROBERTO CLEMENTE’]

These values correspond to the initial five elements in the Name column, specifically located in index 1 of the columns list. Validate the output by cross-checking with the data section of the API URL.

Items in the second sublist (data retrieved from the NY School Snapshot URL using the API).

...
"data" : [ [ "row-h6zp~rx75.iwiv", "00000000-0000-0000-C205-81EF0C7F0969", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20052006", "89.4", null, "281", "15", "36", "40", "33", "38", "52", "29", "38", null, null, null, null, null, null, "36", "12.8", "57", "20.3", "25", "9", "10", "3.6", "74", "26.3", "189", "67.3", "5", "1.8", "158", "56.2", "123", "43.8" ]
, [ "row-bgbf_ntuw.ckdq", "00000000-0000-0000-9AF5-4D58BAF51C20", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20062007", "89.4", null, "243", "15", "29", "39", "38", "34", "42", "46", null, null, null, null, null, null, null, "38", "15.6", "55", "22.6", "19", "15", "18", "7.4", "68", "28", "153", "63", "4", "1.6", "140", "57.6", "103", "42.4" ]
, [ "row-mspc-8wz5_uxb8", "00000000-0000-0000-9E11-73B99A1B02D9", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20072008", "89.4", null, "261", "18", "43", "39", "36", "38", "47", "40", null, null, null, null, null, null, null, "52", "19.9", "60", "23", "20", "14", "16", "6.1", "77", "29.5", "157", "60.2", "7", "2.7", "143", "54.8", "118", "45.2" ]
, [ "row-p6za~9ikt~ums7", "00000000-0000-0000-D2CD-5904BA5DC16E", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20082009", "89.4", null, "252", "17", "37", "44", "32", "34", "39", "49", null, null, null, null, null, null, null, "48", "19", "62", "24.6", "21", "17", "16", "6.3", "75", "29.8", "149", "59.1", "7", "2.8", "149", "59.1", "103", "40.9" ]
, [ "row-55ss_ktcg~w7ru", "00000000-0000-0000-9425-35118FA9200F", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20092010", "   ", "96.5", "208", "16", "40", "28", "32", "30", "24", "38", null, null, null, null, null, null, null, "40", "19.2", "46", "22.1", "14", "14", "16", "7.7", "67", "32.2", "118", "56.7", "6", "2.9", "124", "59.6", "84", "40.4" ]
,
...

To view the complete data snippet, you might have to scroll right within the box. The values you obtained correspond to the highlighted ones.

In conclusion, examine the initial five elements of the subset located at index 2 in d.

Please convert the data in the file named “convert_data.ipynb”.

d[2][0:5]

The following result will be displayed on the screen.

Output

[‘20052006’, ‘20062007’, ‘20072008’, ‘20082009’, ‘20092010’]

Make sure to compare the output with the data section in the URL of the API for verification.

API data from the NY School Snapshot URL includes the components of the second sublist.

...
"data" : [ [ "row-h6zp~rx75.iwiv", "00000000-0000-0000-C205-81EF0C7F0969", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20052006", "89.4", null, "281", "15", "36", "40", "33", "38", "52", "29", "38", null, null, null, null, null, null, "36", "12.8", "57", "20.3", "25", "9", "10", "3.6", "74", "26.3", "189", "67.3", "5", "1.8", "158", "56.2", "123", "43.8" ]
, [ "row-bgbf_ntuw.ckdq", "00000000-0000-0000-9AF5-4D58BAF51C20", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20062007", "89.4", null, "243", "15", "29", "39", "38", "34", "42", "46", null, null, null, null, null, null, null, "38", "15.6", "55", "22.6", "19", "15", "18", "7.4", "68", "28", "153", "63", "4", "1.6", "140", "57.6", "103", "42.4" ]
, [ "row-mspc-8wz5_uxb8", "00000000-0000-0000-9E11-73B99A1B02D9", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20072008", "89.4", null, "261", "18", "43", "39", "36", "38", "47", "40", null, null, null, null, null, null, null, "52", "19.9", "60", "23", "20", "14", "16", "6.1", "77", "29.5", "157", "60.2", "7", "2.7", "143", "54.8", "118", "45.2" ]
, [ "row-p6za~9ikt~ums7", "00000000-0000-0000-D2CD-5904BA5DC16E", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20082009", "89.4", null, "252", "17", "37", "44", "32", "34", "39", "49", null, null, null, null, null, null, null, "48", "19", "62", "24.6", "21", "17", "16", "6.3", "75", "29.8", "149", "59.1", "7", "2.8", "149", "59.1", "103", "40.9" ]
, [ "row-55ss_ktcg~w7ru", "00000000-0000-0000-9425-35118FA9200F", 0, 1425758466, null, 1425758466, null, "{ }", "01M015", "P.S. 015 ROBERTO CLEMENTE", "20092010", "   ", "96.5", "208", "16", "40", "28", "32", "30", "24", "38", null, null, null, null, null, null, null, "40", "19.2", "46", "22.1", "14", "14", "16", "7.7", "67", "32.2", "118", "56.7", "6", "2.9", "124", "59.6", "84", "40.4" ]
,
...

To see the complete data snippet, you might have to scroll horizontally within the box above. The values you obtained are in line with the highlighted values. In essence, these values correspond to the initial five elements within the “schoolyear” column located at index 2 in the columns list.

Once you complete this step, you will have effectively retrieved data from all the rows and columns, and added them into the individual sublists of list d. At this point, you are prepared for the final stage of generating a pandas DataFrame from the acquired data.

Making a Dataframe Dictionary

By this point, you have gathered the names and values of all 38 columns. Now, you will input these details into a pandas DataFrame using a dictionary format. During this stage, you will generate a dictionary where the keys represent the column names (derived from the list “columns”) and the corresponding values will consist of the 38 sublists obtained from “d”.

Execute the provided code in a fresh cell.

rewrite the convert_data.ipynb

json_dict={}

for i in  range(0,len(columns)):
	json_dict.update({columns[i]:d[i]})

While traversing the range from 0 to 38 (excluding 38), you add key-value pairs to the resultant dictionary, json_dict. This dictionary is later passed to pandas DataFrame for conversion. To elucidate, when i is 0, you assign the sublist of d located at index 0 as a value to the key DBN, represented by columns[0]. This process continues for the remaining indexes.

To get a glimpse of the appearance of the dictionary’s acquired keys, examine all the keys available in json_dict.

Please convert the data into a different format using the ‘convert_data.ipynb’ file.

json_dict.keys()

The final dictionary’s list of keys will be displayed on the screen.

Output

dict_keys([‘DBN’, ‘Name’, ‘schoolyear’, ‘fl_percent’, ‘frl_percent’, ‘total_enrollment’, ‘prek’, ‘k’, ‘grade1’, ‘grade2’, ‘grade3’, ‘grade4’, ‘grade5’, ‘grade6’, ‘grade7’, ‘grade8’, ‘grade9’, ‘grade10’, ‘grade11’, ‘grade12’, ‘ell_num’, ‘ell_percent’, ‘sped_num’, ‘sped_percent’, ‘ctt_num’, ‘selfcontained_num’, ‘asian_num’, ‘asian_per’, ‘black_num’, ‘black_per’, ‘hispanic_num’, ‘hispanic_per’, ‘white_num’, ‘white_per’, ‘male_num’, ‘male_per’, ‘female_num’, ‘female_per’])

The 38 keys in json_dict correspond to the 38 sublists of d. This results in a dictionary where the column names act as keys, and each key is associated with a list of elements from its respective column.

You can hand over the dictionary to pandas to create a DataFrame. You can assign any variable name you prefer to this DataFrame (this tutorial assigns it as “data”).

Create a new file named “convert_data.ipynb” and transform the data within it.

data=pd.DataFrame(json_dict)

Afterwards, examine a section of the resulting DataFrame.

Please rephrase “convert_data.ipynb” in a native way.

data.head()

You will see the first five rows of the DataFrame displayed on the screen.

Output

DBN Name schoolyear fl_percent frl_percent total_enrollment prek k grade1 grade2 … black_num black_per hispanic_num hispanic_per white_num white_per male_num male_per female_num female_per 0 01M015 P.S. 015 ROBERTO CLEMENTE 20052006 89.4 None 281 15 36 40 33 … 74 26.3 189 67.3 5 1.8 158 56.2 123 43.8 1 01M015 P.S. 015 ROBERTO CLEMENTE 20062007 89.4 None 243 15 29 39 38 … 68 28 153 63 4 1.6 140 57.6 103 42.4 2 01M015 P.S. 015 ROBERTO CLEMENTE 20072008 89.4 None 261 18 43 39 36 … 77 29.5 157 60.2 7 2.7 143 54.8 118 45.2 3 01M015 P.S. 015 ROBERTO CLEMENTE 20082009 89.4 None 252 17 37 44 32 … 75 29.8 149 59.1 7 2.8 149 59.1 103 40.9 4 01M015 P.S. 015 ROBERTO CLEMENTE 20092010 96.5 208 16 40 28 32 … 67 32.2 118 56.7 6 2.9 124 59.6 84 40.4

In order to see the complete first section of the DataFrame, you might have to scroll towards the right within the box above. The DataFrame has been successfully formed by inputting the data obtained from the URL.

In conclusion

In this tutorial, you learned different ways to create a pandas DataFrame using dictionaries with different formats and a URL. Additionally, you created a DataFrame from an API endpoint where the columns and their values were not presented in a direct manner.

This guide will assist in comprehending the utilization of distinct techniques in generating DataFrames with information obtained from non-traditional API endpoints or dictionaries. To learn more about pandas DataFrames, explore our collection of pandas tutorials.