{"database": "pelican", "table": "content", "is_view": false, "human_description_en": "where published_date = \"2016-11-21\"", "rows": [["ryan", "technology", "I'm an avid [Twitter](https://www.twitter.com) user, mostly as a replacement\n[RSS](https://en.wikipedia.org/wiki/RSS) feeder, but also because I can't\nstand [Facebook](https://www.facebook.com) and this allows me to learn about\nreally important world events when I need to and to just stay isolated with\n[my head in the\nsand](http://gerdleonhard.typepad.com/.a/6a00d8341c59be53ef013488b614d8970c-800wi)\nwhen I don't. It's perfect for me.\n\nOne of the people I follow on [Twitter](https://twitter.com/drdrang) is [Dr.\nDrang](http://www.leancrew.com/all-this/) who is an Engineer of some kind by\ntraining. He also appears to be a fan of baseball and posted an [analysis of\nJake Arrieata's pitching](http://leancrew.com/all-this/2016/09/jake-arrieta-\nand-python/) over the course of the 2016 MLB season (through September 22 at\nleast).\n\nWhen I first read it I hadn't done too much with Python, and while I found the\nresults interesting, I wasn't sure what any of the code was doing (not really\nanyway).\n\nSince I had just spent the last couple of days learning more about\n`BeautifulSoup` specifically and `Python` in general I thought I'd try to do\ntwo things:\n\n  1. Update the data used by Dr. Drang\n  2. Try to generalize it for any pitcher\n\nDr. Drang uses a flat csv file for his analysis and I wanted to use\n`BeautifulSoup` to scrape the data from [ESPN](https://www.espn.com) directly.\n\nOK, I know how to do that (sort of \u00af\\ _(\u30c4)_ /\u00af)\n\nFirst things first, import your libraries:\n\n    \n    \n    import pandas as pd\n    from functools import partial\n    import requests\n    import re\n    from bs4 import BeautifulSoup\n    import matplotlib.pyplot as plt\n    from datetime import datetime, date\n    from time import strptime\n    \n\nThe next two lines I ~~stole~~ borrowed directly from Dr. Drang's post. The\nfirst line is to force the plot output to be inline with the code entered in\nthe terminal. The second he explains as such:\n\n> > The odd ones are the `rcParams` call, which makes the inline graphs bigger\n> than the tiny Jupyter default, and the functools import, which will help us\n> create ERAs over small portions of the season.\n\nI'm not using [Jupyter](http://jupyter.org) I'm using\n[Rodeo](http://rodeo.yhat.com) as my IDE but I kept them all the same:\n\n    \n    \n    %matplotlib inline\n    plt.rcParams['figure.figsize'] = (12,9)\n    \n\nIn the next section I use `BeautifulSoup` to scrape the data I want from\n[ESPN](https://www.espn.com):\n\n    \n    \n    url = 'http://www.espn.com/mlb/player/gamelog/_/id/30145/jake-arrieta'\n    r = requests.get(url)\n    year = 2016\n    \n    date_pitched = []\n    full_ip = []\n    part_ip = []\n    earned_runs = []\n    \n    tables = BeautifulSoup(r.text, 'lxml').find_all('table', class_='tablehead mod-player-stats')\n    for table in tables:\n        for row in table.find_all('tr'): # Remove header\n            columns = row.find_all('td')\n            try:\n                if re.match('[a-zA-Z]{3}\\s', columns[0].text) is not None:\n                    date_pitched.append(\n                        date(\n                        year\n                        , strptime(columns[0].text.split(' ')[0], '%b').tm_mon\n                        , int(columns[0].text.split(' ')[1])\n                        )\n                    )\n                    full_ip.append(str(columns[3].text).split('.')[0])\n                    part_ip.append(str(columns[3].text).split('.')[1])\n                    earned_runs.append(columns[6].text)\n            except Exception as e:\n                pass\n    \n\nThis is basically a rehash of what I did for my Passer scraping\n([here](https://www.ryancheley.com/blog/2016/11/17/web-scrapping),\n[here](https://www.ryancheley.com/blog/2016/11/18/web-scrapping-passer-data-\npart-ii), and [here](https://www.ryancheley.com/blog/2016/11/19/web-scrapping-\npasser-data-part-iii)).\n\nThis proved a useful starting point, but unlike the NFL data on ESPN which has\npre- and regular season breaks, the MLB data on ESPN has monthly breaks, like\nthis:\n\n    \n    \n    Regular Season Games through October 2, 2016\n    DATE\n    Oct 1\n    Monthly Totals\n    DATE\n    Sep 24\n    Sep 19\n    Sep 14\n    Sep 9\n    Monthly Totals\n    DATE\n    Jun 26\n    Jun 20\n    Jun 15\n    Jun 10\n    Jun 4\n    Monthly Totals\n    DATE\n    May 29\n    May 23\n    May 17\n    May 12\n    May 7\n    May 1\n    Monthly Totals\n    DATE\n    Apr 26\n    Apr 21\n    Apr 15\n    Apr 9\n    Apr 4\n    Monthly Totals\n    \n\nHowever, all I wanted was the lines that correspond to `columns[0].text` with\nactual dates like 'Apr 21'.\n\nIn reviewing how the dates were being displayed it was basically '%b %D', i.e.\nMay 12, Jun 4, etc. This is great because it means I want 3 letters and then a\nspace and nothing else. Turns out, Regular Expressions are great for stuff\nlike this!\n\nAfter a bit of [Googling](https://www.google.com) I got what I was looking\nfor:\n\n    \n    \n    re.match('[a-zA-Z]{3}\\s', columns[0].text)\n    \n\nTo get my regular expression and then just add an `if` in front and call it\ngood!\n\nThe only issue was that as I ran it in testing, I kept getting no return data.\nWhat I didn't realize is that returns a `NoneType` when it's false. Enter more\nGoogling and I see that in order for the `if` to work I have to add the `is\nnot None` which leads to results that I wanted:\n\n    \n    \n    Oct 22\n    Oct 16\n    Oct 13\n    Oct 11\n    Oct 7\n    Oct 1\n    Sep 24\n    Sep 19\n    Sep 14\n    Sep 9\n    Jun 26\n    Jun 20\n    Jun 15\n    Jun 10\n    Jun 4\n    May 29\n    May 23\n    May 17\n    May 12\n    May 7\n    May 1\n    Apr 26\n    Apr 21\n    Apr 15\n    Apr 9\n    Apr 4\n    \n\nThe next part of the transformation is to convert to a date so I can sort on\nit (and display it properly) later.\n\nWith all of the data I need, I put the columns into a `Dictionary`:\n\n    \n    \n    dic = {'date': date_pitched, 'Full_IP': full_ip, 'Partial_IP': part_ip, 'ER': earned_runs}\n    \n\nand then into a `DataFrame`:\n\n    \n    \n    games = pd.DataFrame(dic)\n    \n\nand apply some manipulations to the `DataFrame`:\n\n    \n    \n    games = games.sort_values(['date'], ascending=[True])\n    games[['Full_IP','Partial_IP', 'ER']] = games[['Full_IP','Partial_IP', 'ER']].apply(pd.to_numeric)\n    \n\nNow to apply some Baseball math to get the Earned Run Average:\n\n    \n    \n    games['IP'] = games.Full_IP + games.Partial_IP/3\n    games['GERA'] = games.ER/games.IP*9\n    games['CIP'] = games.IP.cumsum()\n    games['CER'] = games.ER.cumsum()\n    games['ERA'] = games.CER/games.CIP*9\n    \n\nIn the next part of Dr. Drang's post he writes a custom function to help\ncreate moving averages. It looks like this:\n\n    \n    \n    def rera(games, row):\n        if row.name+1 < games:\n            ip = df.IP[:row.name+1].sum()\n            er = df.ER[:row.name+1].sum()\n        else:\n            ip = df.IP[row.name+1-games:row.name+1].sum()\n            er = df.ER[row.name+1-games:row.name+1].sum()\n        return er/ip*9\n    \n\nThe only problem with it is I called my `DataFrame` `games`, not `df`. Simple\nenough, I'll just replace `df` with `games` and call it a day, right? Nope:\n\n    \n    \n    def rera(games, row):\n        if row.name+1 < games:\n            ip = games.IP[:row.name+1].sum()\n            er = games.ER[:row.name+1].sum()\n        else:\n            ip = games.IP[row.name+1-games:row.name+1].sum()\n            er = games.ER[row.name+1-games:row.name+1].sum()\n        return er/ip*9\n    \n\nWhen I try to run the code I get errors. Lots of them. This is because while i\nmade sure to update the `DataFrame` name to be correct I overlooked that the\nfunction was using a parameter called `games` and `Python` got a bit confused\nabout what was what.\n\nOK, round two, replace the parameter `games` with `games_t`:\n\n    \n    \n    def rera(games_t, row):\n        if row.name+1 < games_t:\n            ip = games.IP[:row.name+1].sum()\n            er = games.ER[:row.name+1].sum()\n        else:\n            ip = games.IP[row.name+1-games_t:row.name+1].sum()\n            er = games.ER[row.name+1-games_t:row.name+1].sum()\n        return er/ip*9\n    \n\nNo more errors! Now we calculate the 3- and 4-game moving averages:\n\n    \n    \n    era4 = partial(rera, 4)\n    era3 = partial(rera,3)\n    \n\nand then add them to the `DataFrame`:\n\n    \n    \n    games['ERA4'] = games.apply(era4, axis=1)\n    games['ERA3'] = games.apply(era3, axis=1)\n    \n\nAnd print out a pretty graph:\n\n    \n    \n    plt.plot_date(games.date, games.ERA3, '-b', lw=2)\n    plt.plot_date(games.date, games.ERA4, '-r', lw=2)\n    plt.plot_date(games.date, games.GERA, '.k', ms=10)\n    plt.plot_date(games.date, games.ERA, '--k', lw=2)\n    plt.show()\n    \n\nDr. Drang focused on Jake Arrieta (he is a Chicago guy after all), but I\nthought it was be interested to look at the Graphs for Arrieta and the top 5\nfinishers in the NL Cy Young Voting (because Clayton Kershaw was 5th place and\nI'm a Dodgers guy).\n\nHere is the graph for [Jake\nArrieata](http://www.espn.com/mlb/player/gamelog/_/id/30145/jake-arrieta):\n\n![Jake Arrieata](/images/uploads/2016/11/arrieta-300x222.png)\n\nAnd here are the graphs for the top 5 finishers in Ascending order in the\n[2016 NL Cy Young voting](http://bbwaa.com/16-nl-cy/):\n\n[Max Scherzer](http://www.espn.com/mlb/player/gamelog/_/id/28976/max-scherzer)\nwinner of the 2016 NL [Cy Young\nAward](https://en.wikipedia.org/wiki/Cy_Young_Award) ![Max\nScherzer](/images/uploads/2016/11/scherzer-300x229.png)\n\n[Jon Lester](http://www.espn.com/mlb/player/gamelog/_/id/28487/jon-lester)\n![Jon Lester](/images/uploads/2016/11/lester-300x223.png)\n\n[Kyle Hendricks](http://www.espn.com/mlb/player/gamelog/_/id/33173/kyle-\nhendricks) ![Kyle Hendricks](/images/uploads/2016/11/hendricks-300x225.png)\n\n[Madison Bumgarner](http://www.espn.com/mlb/player/gamelog/_/id/29949/madison-\nbumgarner) ![Madison Bumgarner](/images/uploads/2016/11/bumgarner-300x232.png)\n\n[Clayton Kershaw](http://www.espn.com/mlb/player/gamelog/_/id/28963/clayton-\nkershaw):\n\n![Clayton Kershaw](/images/uploads/2016/11/kershaw-300x232.png)\n\nI've not spent much time analyzing the data, but I'm sure that it says\n_something_. At the very least, it got me to wonder, 'How many 0 ER games did\neach pitcher pitch?'\n\nI also noticed that the stats include the playoffs (which I wasn't intending).\nAnother thing to look at later.\n\nLegend:\n\n  * Black Dot - ERA on Date of Game\n  * Black Solid Line - Cumulative ERA\n  * Blue Solid Line - 3-game trailing average ERA\n  * Red Solid Line - 4-game trailing average ERA\n\nFull code can be found on my [Github Repo](https://www.github.com/miloardot)\n\n", "2016-11-21", "pitching-stats-and-python", "I'm an avid [Twitter](https://www.twitter.com) user, mostly as a replacement\n[RSS](https://en.wikipedia.org/wiki/RSS) feeder, but also because I can't\nstand [Facebook](https://www.facebook.com) and this allows me to learn about\nreally important world events when I need to and to just stay isolated with\n[my head in the\nsand](http://gerdleonhard.typepad.com/.a/6a00d8341c59be53ef013488b614d8970c-800wi)\nwhen I don't. It's perfect for \u2026\n\n", "Pitching Stats and Python", "https://www.ryancheley.com/2016/11/21/pitching-stats-and-python/"]], "truncated": false, "filtered_table_rows_count": 1, "expanded_columns": [], "expandable_columns": [], "columns": ["author", "category", "content", "published_date", "slug", "summary", "title", "url"], "primary_keys": ["slug"], "units": {}, "query": {"sql": "select author, category, content, published_date, slug, summary, title, url from content where \"published_date\" = :p0 order by slug limit 101", "params": {"p0": "2016-11-21"}}, "facet_results": {}, "suggested_facets": [{"name": "published_date", "type": "date", "toggle_url": "http://search.ryancheley.com/pelican/content.json?published_date=2016-11-21&_facet_date=published_date"}], "next": null, "next_url": null, "private": false, "allow_execute_sql": true, "query_ms": 12.854119297116995}