{"database": "pelican", "table": "content", "rows": [["ryan", "musings", "In Part III I'm reviewing the code to populate a DataFrame with Passer data\nfrom the current NFL season.\n\nTo start I use the `games` `DataFrame` I created in [Part\nII](https://www.ryancheley.com/blog/2016/11/18/web-scrapping-passer-data-part-\nii) to create 4 new `DataFrames`:\n\n  * reg_season_games - All of the Regular Season Games\n  * pre_season_games - All of the Pre Season Games\n  * gameshome - The Home Games\n  * gamesaway - The Away Games\n\nA cool aspect of the DataFrames is that you can treat them kind of like\ntemporary tables (at least, this is how I'm thinking about them as I am mostly\na `SQL` programmer) and create other temporary tables based on criteria. In\nthe code below I'm taking the `nfl_start_date` that I defined in [Part\nII](https://www.ryancheley.com/blog/2016/11/18/web-scrapping-passer-data-part-\nii) as a way to split the data frame into pre / and regular season\n`DataFrame`. I then take the regular season `DataFrame` and split that into\nhome and away `DataFrames`. I do this so I don't double count the statistics\nfor the passers.\n\n    \n    \n    #Start Section 3\n    \n    reg_season_games = games.loc[games['match_date'] >= nfl_start_date]\n    pre_season_games = games.loc[games['match_date'] < nfl_start_date]\n    \n    gameshome = reg_season_games.loc[reg_season_games['ha_ind'] == 'vs']\n    gamesaway = reg_season_games.loc[reg_season_games['ha_ind'] == '@']\n    \n\nNext, I set up some variables to be used later:\n\n    \n    \n    BASE_URL = 'http://www.espn.com/nfl/boxscore/_/gameId/{0}'\n    \n    #Create the lists to hold the values for the games for the passers\n    player_pass_name = []\n    player_pass_catch = []\n    player_pass_attempt = []\n    player_pass_yds = []\n    player_pass_avg = []\n    player_pass_td = []\n    player_pass_int = []\n    player_pass_sacks = []\n    player_pass_sacks_yds_lost = []\n    player_pass_rtg = []\n    player_pass_week_id = []\n    player_pass_result = []\n    player_pass_team = []\n    player_pass_ha_ind = []\n    player_match_id = []\n    player_id = [] #declare the player_id as a list so it doesn't get set to a str by the loop below\n    \n    headers_pass = ['match_id', 'id', 'Name', 'CATCHES','ATTEMPTS', 'YDS', 'AVG', 'TD', 'INT', 'SACKS', 'YRDLSTSACKS', 'RTG']\n    \n\nNow it's time to start populating some of the `list` variables I created\nabove. I am taking the `week_id`, `result`, `team_x`, and `ha_ind` columns\nfrom the `games` `DataFrame` (I'm sure there is a better way to do this, and I\nwill need to revisit it in the future)\n\n    \n    \n    player_pass_week_id.append(gamesaway.week_id)\n    player_pass_result.append(gamesaway.result)\n    player_pass_team.append(gamesaway.team_x)\n    player_pass_ha_ind.append(gamesaway.ha_ind)\n    \n\nNow for the looping (everybody's favorite part!). Using `BeautifulSoup` I get\nthe `div` of class `col column-one gamepackage-away-wrap`. Once I have that I\nget the table rows and then loop through the data in the row to get what I\nneed from the table holding the passer data. Some interesting things happening\nbelow:\n\n  * The Catches / Attempts and Sacks / Yrds Lost are displayed as a single column each (even though each column holds 2 statistics). In order to _fix_ this I use the `index()` method and get all of the data to the left of a character (`-` and `/` respectively for each column previously mentioned) and append the resulting 2 items per column (so four in total) to 2 different lists (four in total).\n\nThe last line of code gets the [ESPN](https://www.espn.com) `player_id`, just\nin case I need/want to use it later.\n\n    \n    \n    for index, row in gamesaway.iterrows():\n        print(index)\n        try:\n            request = requests.get(BASE_URL.format(index))\n            table_pass = BeautifulSoup(request.text, 'lxml').find_all('div', class_='col column-one gamepackage-away-wrap')\n    \n            pass_ = table_pass[0]\n            player_pass_all = pass_.find_all('tr')\n    \n    \n            for tr in player_pass_all:\n                for td in tr.find_all('td', class_='sacks'):\n                    for t in tr.find_all('td', class_='name'):\n                        if t.text != 'TEAM':\n                            player_pass_sacks.append(int(td.text[0:td.text.index('-')]))\n                            player_pass_sacks_yds_lost.append(int(td.text[td.text.index('-')+1:]))\n                for td in tr.find_all('td', class_='c-att'):\n                    for t in tr.find_all('td', class_='name'):\n                        if t.text != 'TEAM':\n                            player_pass_catch.append(int(td.text[0:td.text.index('/')]))\n                            player_pass_attempt.append(int(td.text[td.text.index('/')+1:]))\n                for td in tr.find_all('td', class_='name'):\n                    for t in tr.find_all('td', class_='name'):\n                        for s in t.find_all('span', class_=''):\n                            if t.text != 'TEAM':\n                                player_pass_name.append(s.text)\n                for td in tr.find_all('td', class_='yds'):\n                    for t in tr.find_all('td', class_='name'):\n                        if t.text != 'TEAM':\n                            player_pass_yds.append(int(td.text))\n                for td in tr.find_all('td', class_='avg'):\n                    for t in tr.find_all('td', class_='name'):\n                        if t.text != 'TEAM':\n                            player_pass_avg.append(float(td.text))\n                for td in tr.find_all('td', class_='td'):\n                    for t in tr.find_all('td', class_='name'):\n                        if t.text != 'TEAM':\n                            player_pass_td.append(int(td.text))\n                for td in tr.find_all('td', class_='int'):\n                    for t in tr.find_all('td', class_='name'):\n                        if t.text != 'TEAM':\n                            player_pass_int.append(int(td.text))\n                for td in tr.find_all('td', class_='rtg'):\n                    for t in tr.find_all('td', class_='name'):\n                        if t.text != 'TEAM':\n                            player_pass_rtg.append(float(td.text))\n                            player_match_id.append(index)\n                #The code below cycles through the passers and gets their ESPN Player ID\n                for a in tr.find_all('a', href=True):\n                    player_id.append(a['href'].replace(\"http://www.espn.com/nfl/player/_/id/\",\"\")[0:a['href'].replace(\"http://www.espn.com/nfl/player/_/id/\",\"\").index('/')])\n    \n        except Exception as e:\n            pass\n    \n\nWith all of the data from above we now populate our `DataFrame` using specific\nheaders (that's why we set the `headers_pass` variable above):\n\n    \n    \n    player_passer_data = pd.DataFrame(np.column_stack((\n    player_match_id,\n    player_id,\n    player_pass_name,\n    player_pass_catch,\n    player_pass_attempt,\n    player_pass_yds,\n    player_pass_avg,\n    player_pass_td,\n    player_pass_int,\n    player_pass_sacks,\n    player_pass_sacks_yds_lost,\n    player_pass_rtg\n    )), columns=headers_pass)\n    \n\nAn issue that I ran into as I was playing with the generated `DataFrame` was\nthat even though I had set the numbers generated in the `for` loop above to be\nof type `int` anytime I would do something like a `sum()` on the `DataFrame`\nthe numbers would be concatenated as though they were `strings` (because they\nwere!).\n\nAfter much [Googling](https://www.google.com) I came across a [useful\nanswer](http://stackoverflow.com/questions/15891038/pandas-change-data-type-\nof-columns) on [StackExchange](https://www.stackexchange.com) (where else\nwould I find it, right?)\n\nWhat it does is to set the data type of the columns from `string` to `int`\n\n    \n    \n    player_passer_data[['TD', 'CATCHES', 'ATTEMPTS', 'YDS', 'INT', 'SACKS', 'YRDLSTSACKS','AVG','RTG']] = player_passer_data[['TD', 'CATCHES', 'ATTEMPTS', 'YDS', 'INT', 'SACKS', 'YRDLSTSACKS','AVG','RTG']].apply(pd.to_numeric)\n    \n\nOK, so I've got a `DataFrame` with passer data, I've got a `DataFrame` with\naway game data, now I need to join them. As expected, `pandas` has a way to\njoin `DataFrame` data ... with the [join](http://pandas.pydata.org/pandas-\ndocs/stable/generated/pandas.DataFrame.join.html) method obviously!\n\nI create a new `DataFrame` called `game_passer_data` which joins\n`player_passer_data` with `games_away` on their common key `match_id`. I then\nhave to use `set_index` to make sure that the index stays set to `match_id`\n... If I don't then the `index` is reset to an auto-incremented integer.\n\n    \n    \n    game_passer_data = player_passer_data.join(gamesaway, on='match_id').set_index('match_id')\n    \n\nThis is great, but now `game_passer_data` has all of these extra columns.\nBelow is the result of running `game_passer_data.head()` from the terminal:\n\n    \n    \n    id          Name  CATCHES  ATTEMPTS  YDS  AVG  TD  INT  SACKS\n    \n\nmatch_id 400874518 2577417 Dak Prescott 22 30 292 9.7 0 0 4 400874674 2577417\nDak Prescott 23 32 245 7.7 2 0 2 400874733 2577417 Dak Prescott 18 27 247 9.1\n3 1 2 400874599 2577417 Dak Prescott 21 27 247 9.1 3 0 0 400874599 12482 Mark\nSanchez 1 1 8 8.0 0 0 0\n\n    \n    \n               YRDLSTSACKS                        ...\n    \n\nmatch_id ... 400874518 14 ... 400874674 11 ... 400874733 14 ... 400874599 0\n... 400874599 0 ...\n\n    \n    \n               ha_ind  match_date                  opp result          team_x\n    \n\nmatch_id 400874518 @ 2016-09-18 washington-redskins W Dallas Cowboys 400874674\n@ 2016-10-02 san-francisco-49ers W Dallas Cowboys 400874733 @ 2016-10-16\ngreen-bay-packers W Dallas Cowboys 400874599 @ 2016-11-06 cleveland-browns W\nDallas Cowboys 400874599 @ 2016-11-06 cleveland-browns W Dallas Cowboys\n\n    \n    \n              week_id prefix_1             prefix_2               team_y\n    \n\nmatch_id 400874518 2 wsh washington-redskins Washington Redskins 400874674 4\nsf san-francisco-49ers San Francisco 49ers 400874733 6 gb green-bay-packers\nGreen Bay Packers 400874599 9 cle cleveland-browns Cleveland Browns 400874599\n9 cle cleveland-browns Cleveland Browns\n\n    \n    \n                                                             url\n    match_id\n    400874518  http://www.espn.com/nfl/team/_/name/wsh/washin...\n    400874674  http://www.espn.com/nfl/team/_/name/sf/san-fra...\n    400874733  http://www.espn.com/nfl/team/_/name/gb/green-b...\n    400874599  http://www.espn.com/nfl/team/_/name/cle/clevel...\n    400874599  http://www.espn.com/nfl/team/_/name/cle/clevel...\n    \n\nThat is nice, but not exactly what I want. In order to remove the _extra_\ncolumns I use the `drop` method which takes 2 arguments:\n\n  * what object to drop\n  * an axis which determine what types of object to drop (0 = rows, 1 = columns):\n\nBelow, the object I define is a list of columns (figured that part all out on\nmy own as the documentation didn't explicitly state I could use a list, but I\nfigured, what's the worst that could happen?)\n\n    \n    \n    game_passer_data = game_passer_data.drop(['opp', 'prefix_1', 'prefix_2', 'url'], 1)\n    \n\nWhich gives me this:\n\n    \n    \n    id          Name  CATCHES  ATTEMPTS  YDS  AVG  TD  INT  SACKS\n    \n\nmatch_id 400874518 2577417 Dak Prescott 22 30 292 9.7 0 0 4 400874674 2577417\nDak Prescott 23 32 245 7.7 2 0 2 400874733 2577417 Dak Prescott 18 27 247 9.1\n3 1 2 400874599 2577417 Dak Prescott 21 27 247 9.1 3 0 0 400874599 12482 Mark\nSanchez 1 1 8 8.0 0 0 0\n\n    \n    \n               YRDLSTSACKS    RTG ha_ind  match_date result          team_x\n    \n\nmatch_id 400874518 14 103.8 @ 2016-09-18 W Dallas Cowboys 400874674 11 114.7 @\n2016-10-02 W Dallas Cowboys 400874733 14 117.4 @ 2016-10-16 W Dallas Cowboys\n400874599 0 141.8 @ 2016-11-06 W Dallas Cowboys 400874599 0 100.0 @ 2016-11-06\nW Dallas Cowboys\n\n    \n    \n              week_id               team_y\n    match_id\n    400874518       2  Washington Redskins\n    400874674       4  San Francisco 49ers\n    400874733       6    Green Bay Packers\n    400874599       9     Cleveland Browns\n    400874599       9     Cleveland Browns\n    \n\nI finally have a `DataFrame` with the data I care about, **BUT** all of the\ncolumn names are wonky!\n\nThis is easy enough to fix (and should have probably been fixed earlier with\nsome of the objects I created only containing the necessary columns, but I can\nfix that later). By simply renaming the columns as below:\n\n    \n    \n    game_passer_data.columns = ['id', 'Name', 'Catches', 'Attempts', 'YDS', 'Avg', 'TD', 'INT', 'Sacks', 'Yards_Lost_Sacks', 'Rating', 'HA_Ind', 'game_date', 'Result', 'Team', 'Week', 'Opponent']\n    \n\nI now get the data I want, with column names to match!\n\n    \n    \n    id          Name  Catches  Attempts  YDS  Avg  TD  INT  Sacks\n    \n\nmatch_id 400874518 2577417 Dak Prescott 22 30 292 9.7 0 0 4 400874674 2577417\nDak Prescott 23 32 245 7.7 2 0 2 400874733 2577417 Dak Prescott 18 27 247 9.1\n3 1 2 400874599 2577417 Dak Prescott 21 27 247 9.1 3 0 0 400874599 12482 Mark\nSanchez 1 1 8 8.0 0 0 0\n\n    \n    \n               Yards_Lost_Sacks  Rating HA_Ind   game_date Result            Team\n    \n\nmatch_id 400874518 14 103.8 @ 2016-09-18 W Dallas Cowboys 400874674 11 114.7 @\n2016-10-02 W Dallas Cowboys 400874733 14 117.4 @ 2016-10-16 W Dallas Cowboys\n400874599 0 141.8 @ 2016-11-06 W Dallas Cowboys 400874599 0 100.0 @ 2016-11-06\nW Dallas Cowboys\n\n    \n    \n              Week             Opponent\n    match_id\n    400874518    2  Washington Redskins\n    400874674    4  San Francisco 49ers\n    400874733    6    Green Bay Packers\n    400874599    9     Cleveland Browns\n    400874599    9     Cleveland Browns\n    \n\nI've posted the code for all three parts to my [GitHub\nRepo](https://www.github.com/miloardot).\n\nWork that I still need to do:\n\n  1. Add code to get the home game data\n  2. Add code to get data for the other position players\n  3. Add code to get data for the defense\n\nWhen I started this project on Wednesday I had only a bit of exposure to very\nbasic aspects of `Python` and my background as a developer. I'm still a long\nway from considering myself proficient in `Python` but I know more now that I\ndid 3 days ago and for that I'm pretty excited! It's also given my an\n~~excuse~~ reason to write some stuff which is a nice side effect.\n\n", "2016-11-19", "web-scrapping-passer-data-part-iii", "In Part III I'm reviewing the code to populate a DataFrame with Passer data\nfrom the current NFL season.\n\nTo start I use the `games` `DataFrame` I created in [Part\nII](https://www.ryancheley.com/blog/2016/11/18/web-scrapping-passer-data-part-\nii) to create 4 new `DataFrames`:\n\n  * reg_season_games - All of the Regular Season Games\n  * pre_season_games - All of the Pre Season Games \u2026\n\n", "Web Scrapping - Passer Data (Part III)", "https://www.ryancheley.com/2016/11/19/web-scrapping-passer-data-part-iii/"]], "columns": ["author", "category", "content", "published_date", "slug", "summary", "title", "url"], "primary_keys": ["slug"], "primary_key_values": ["web-scrapping-passer-data-part-iii"], "units": {}, "query_ms": 6.470291875302792}