Installing Stata with Windows on Amazon EC2

The following is a simple guide to installing Stata on Windows using Amazon EC2, which I created to help out fellow UC Berkeley researchers who have stopped by D-Lab. I was not able to find anything similar on the web so I hope this can facilitate people trying out EC2 and providing a low-effort way to gain capacity beyond a single laptop. Of course using Linux and R or Python is also a good approach but plenty of guides already exist to do that (I use the Berkeley Common Environment AMI personally). Any questions or comments would be greatly appreciated!

  1. Create an Amazon AWS account.
    • Academics can sign up for extra educational credits through the AWS Educate program.
    • You may want to consult IT services at your institution to determine if discounts or other negotiations are in place with Amazon.
    • AWS in Education Grants are worth considering as well if you are at an educational institution.
    • Note: it can take two hours or so for Amazon to activate a new AWS account, so you may need to take a break before you are able to create a new compute instance.
  2. Login at - click the “Sign in to the Console” button.
  3. Click EC2 in the upper left corner. It should be the first service in the list.
  4. Click “Launch Instance”.
  5. Scroll down the Quick Start list to choose your AMI. An easy choice is “Microsoft Windows Server 2012 R2 Base” with 64-bit architecture. The logo will say “free tier eligible.” Click the blue “Select” button.
  6. Choose the instance that you want based on how powerful of a server you need. You can review the pricing here (be sure to click the “Windows” pricing tab, and select the region you intend to use).
    • One of the “memory optimized” r3 options or “computer optimized” c3/c4 options is often good for data analysis. For large datasets you can choose “r3.4xlarge” which currently boasts 16 virtual CPUs and 122GB of RAM and costs about $2 per hour.
    • Otherwise choose a server that has sufficient ram to load your desired datasets, plus some breathing room (at least 2gb but preferably 4gb+), and a good number of virtual CPUs if you have Stata MP.
      • Hard drive side is usually not that important because you can always create a separate EBS volume (think external hard drive) and attach that to your instance for extra storage.
  7. Click the blue button “Review and Launch.”
  8. Click “Edit security groups” - currently the third link on the right side of the page.
  9. Select “Create a new security group” - this should be selected by default.
  10. The default rule should be “RDP”. Go to the “Source” box and change the selection from “Anywhere” to “My IP”.
    • That allows you to connect remotely from your current IP and nowhere else, which drastically improves security for the EC2 instance.
    • Alternatively you can select “Custom IP” and list all IPs that you would use to connect from, with a comma between each IP.
    • If you change IPs you won’t be able to connect to the instance, so you will want to be able to connect from one or two relatively stable IPs (like your home and work IPs).
    • Or you can risk it and leave the original “” to not restrict what IPs are allowed to connect to the instance.
  11. Change the security group name to “Stata and Windows” or similar, and customize the description if you like.
  12. Click “Review and Launch” then “Launch”.
  13. Create a security key pair if you have not already done so.
    • Click “create a new key pair”
    • Make the name be “Stata and Windows key” or similar.
    • Click “download key pair” and save it to a folder that you will remember, like “Documents / Amazon EC2 / Key pairs”.
  14. Click “Launch Instances”
  15. Click “View instances”
  16. Find the new instance that you just launched, which will initially say “pending” in the “Instance State” column.
  17. Move your mouse to the blank space under the name column, click the pencil icon to edit the name, and name the instance something like “Stata and Windows”.
  18. In a minute or two the Instance State column will change to green “running”.
  19. Once it is running, right click the “running” text and select “Get windows password”
    • Initially it will say “Password not available yet”
    • Wait another minute and click the “Try again” orange link.
  20. Continue this until the instance has booted up and you can retrieve the password.
    • Next to “Key Pair Path” click the “choose file” button and navigate to the folder where you saved your key pair. Select that file.
    • Click “Decrypt Password”
    • Select the Public IP, username, and password text and paste them into a text file so that we can use them in a minute.
    • Select the password text and copy it to your clipboard so that you can paste it later.
    • Click the “Close” button to close that dialog box.
  21. Right click the “running” text again and select “Connect”
  22. Click “Download Remote Desktop File”
  23. On your computer, open the Microsoft Remote Desktop application.
    • If you're on OSX, you might also try the CoRD remote desktop application for connecting, which some people prefer to Microsoft Remote Desktop. In my experience it doesn't open the .rdp file from Amazon automatically so you have to add the connection manually (IP address, username).
  24. Click File -> Import, and select the RDP file that you just downloaded.
  25. You will see a new Desktop option in Microsoft Remote Desktop with the IP of the Amazon server.
    • Right click the IP and select “Edit”
    • Put a name into the “Connection Name” box, like “Windows on Amazon EC2”
    • IMPORTANT: whenever you start up your instance again in the future it will have a different IP address, assigned randomly by Amazon. You will need to edit the "PC name" field here and paste in the updated IP address (or DNS name - they are the same thing) so that you can connect. Alternatively you can re-download the remote desktop file from Amazon, which will have the updated IP address, and re-add it to Microsoft Remote Desktop, but that takes more configuration.
    • Click the “Redirection” tab then click the “+” icon at the bottom to add folders on your local computer that you want to be able to access on your Amazon instance.
    • Even though you will be able to “see” these folders in the Amazon Windows instance, if you want to use or copy those files on the instance they will be transferring across the internet from your local computer to Amazon. As a result any large files will be extremely slow to access until they have been fully copied from your local computer to the cloud instance.
    • Click the red “x” to close the dialog.
  26. Double-click the IP address of your remote instance to connect to it.
  27. Once you get to the login screen, paste your password into the paste box. You may need to type it out manually.
  28. You may want to change the password on the instance to something that you will remember.
    • Click the Windows icon on the left side of the bottom task menu.
    • Click Control Panel then under “User Accounts” click Change Account Type
    • Click the Administrator account, then “Change the password”
    • Enter the default password from Amazon, then change the new password to something you will remember.
  29. You may want to disable Internet Explorer Enhanced Security.
    • Click the “Server Manager” icon on the bottom taskbar (2nd icon from the left).
    • Click “Local Server” on the left side list.
    • On the right side column of items, click “On” next to “IE Enhanced Security Configuration”
    • Select “Off” for Administrators.
    • This will allow you to download executable files from, for example (like the Stata installation program).
  30. Install Stata
    • You will need to copy the installation executable file to the Amazon server. For Stata 13 it is named "SetupStata13.exe". You could do this by uploading the file to Dropbox or by copying it to a folder on your laptop and then sharing that folder with the Amazon instance.
    • After Stata is installed you will need to enter the Serial number, Code, and Authorization from your Stata license.
  31. Start analyzing!
  32. When you are done, you can go back to your Amazon console, right click the instance, and select “ Instance State -> Stop” to stop the instance.
    • At this point you will no longer be paying for compute time, but you will be paying a slight amount for the hard drive to stay available.
  33. You can then right click and select “Image -> Create Image” to create a backup of your EC2 cloud computer.
    • In the image name box, name it something like "Windows Stata April 28" (or whatever date it is for you).
    • Click "Create Image".
  34. Click to the AMIs section (left menu) and wait until the image has been created.
    • The image can then be used the next time you launch an instance. Instead of selecting the “Microsoft Windows Server 2012 R2 Base” you would click to “My AMIs” and choose this AMI out of the list.
      • This allows you to have all of your software and data ready to go within minutes.
    • You will then be charged a small amount per month based on the size of the image.
  35. Then go back to instance, right click, and select “Instance State -> Terminate” to remove the instance.


After helping a few people figure out errors in their setup, I am starting a trouble-shooting section below:

  1. Can't connect to the server?
    • Check your wireless network. If you are connected to a "visitor" or "guest" network it may not allow Remote Desktop connections. Make sure that you are using the full authorized wireless network.

Stumbling blocks, questions, or other suggestions to improve the guide? Please post in the comments. I have only been able to try this out from my Mac laptop, so if there are any improvements to the instructions that could be made for Windows users I'd be glad to incorporate those. I'm also planning to add in some screenshots and/or a screencast when I have a chance.

Advice for admitted political science PhD students

As the next round of admitted political science PhD students starts thinking about what they should be doing to prepare for the fall, I thought I would write down a few of my own reflections on what would have been best in retrospect. Granted, this is only based on my past ~0.6 years of PhD life, but some may find it helpful in thinking through their own preparation plan of attack.

1. Expand your studies of interesting faculty at each admitted institution - read their CVs, skim their major papers, and review what classes they teach. Make notes of who is most appealing and why. You already did this once for your applications but a second go-around is worthwhile.

2. Review the grad handbooks and understand the formal degree requirements at each institution. If needed confirm the rules with the subfield chairs, because written policies can be outdated.

3. If you are gainfully employed, start shifting your weekly expenditures to savings mode - you are currently wealthier than you will be for a very long time. In particular, try to save enough to pay for your living expenses for one semester - you could partially self-fund so that you don’t have to TA a class.

4. Starting thinking of how you can hedge your bets if an academic career doesn’t pan out. Plan your courses and skill development so that you are able to get a good job in industry if need be.

5. Develop a draft course schedule for the next 2 years, then discuss with current grad students. In particular, get their advice on workload, field exam preparation, and relevant courses outside of the department.

6. Work through Moore and Siegel’s “A Math Course for Political and Social Scientists”. One chapter a week is a reasonable pace. Write up your answers in Latex using RStudio+Knitr - here is an example template. In RStudio do File -> New -> R Sweave to start a new file, and check your RStudio options under Sweave and make sure that knitr is selected rather than Sweave.

7. Get the syllabi for your substantive field seminars from current students and buy the required books. Try to read 5 of the books in advance, taking notes along the way. If you can generate a one-page summary of each that would be ideal.

8. Start thinking about your NSF application - review the guidelines, deadlines, and set out a project schedule so that you get an application submitted in your first year. Ask NSF winners if you can take a peek at the materials they submitted.

9. Setup meetings with your current colleagues to discuss 1) research ideas, and 2) consulting opportunities - you would be surprised how many people will meet for coffee if you ask. Try to get 2 consulting projects confirmed before you leave for grad school.

10. Work through Chris Paciorek’s R Bootcamp (or similar intro), unless you are already very comfortable with R. Again, use Latex via RStudio+Knitr to write up any responses so that you can hit the ground running.

11. Take at least two weeks off before you need to move out for math camp. Go hiking, play guitar, veg out, read some fiction, get rid of junk, and start packing.

Even doing a few items from the list would be great. I'm sure others have their own advice (any comments are welcome!), and these suggestions are assuredly biased by my personal interest in methodology. Nevertheless, food for thought.

New Timelapse: A Daily Ballet, Subtly Seen

My latest timelapse video, glimpsing the daily ballet which surrounds us. This project encompasses 911 photos taken over 31 hours, one photo every 2 minutes, taken May 26, 2013 - May 27, 2013 in Washington DC.

Best viewed in HD with screen and volume maxed.

A Quick Case For Vitamin Supplementation in Women

Editor: The following is an email from February 2012 that I wrote to a few friends after a discussion about the need for multivitamins. I have been meaning to post it for a while and the recent new research on vitamins is a great excuse to do so. The recent research does not address the primary rationale for taking a multivitamin, which is as a nutritional "insurance policy", not as a way to reduce risk of death or cancer. I do agree that taking individual vitamins or supplements without a medical justification (i.e. deficiency) is unwise. One friend was later diagnosed as deficient in Vitamin D by her doctor.

I think there are four nutrients with related research which suggests a particularly good medical rationale to take a daily multivitamin, even among those whose diet is very healthy and are asymptomatic. Just posting some quick quotes and links here rather than editorializing:

1. Folate (B9)

  • RDA(^) for women age 19-30: 320mcg. Amount provided in a multivitamin(+): 400mcg. Tolerable upper limit(*): 1000mcg.
  • "In view of evidence linking folate intake with neural tube defects in the fetus, it is recommended that all women capable of becoming pregnant consume 400 μg [mcg] from supplements or fortified foods in addition to intake of food folate from a varied diet." [1, page 2]
  • "Timing of folate is critical: For folate to be effective, it must be taken in the first few weeks after conception, often before a woman knows she is pregnant." [2]
  • "Those who drink may benefit the most from getting extra folate, since alcohol moderately depletes our body’s stores." [3]

2. Vitamin B12 (if you are not eating meat regularly)

  • RDA for women age 19-30: 2.4mcg. Amount provided in a multivitamin: 6mcg. Tolerable upper limit: unknown due to lack of adverse effects.
  • "Some people who eat little or no animal foods such as vegetarians and vegans. Only animal foods have vitamin B12 naturally. When pregnant women and women who breastfeed their babies are strict vegetarians or vegans, their babies might also not get enough vitamin B12." [4]
  • "But even vegetarians who eat eggs and dairy products consume, on average, less than half the adult Recommended Dietary Allowance of 2.4 mcg of B12, notes the Health Letter."… "The Harvard Health Letter recommends that vegetarians and older people with atrophic gastritis take a multivitamin, eat fortified breakfast cereal, or both." [5]
  • "It is prudent to advise all vegetarian and vegan patients, particularly if they are elderly or anticipating a pregnancy, to consume synthetic cobalamin daily, either by taking a supplement containing vitamin B12 or eating a serving of vitamin B12–fortified grain products." [10]

3. Iron

  • RDA for women age 19-30: 18mg [for men it's only 8mg]. Amount provided in a multivitamin: 18mg. Tolerable upper limit: 45mg.
  • "Iron deficiency is the most common nutritional disorder affecting about 20-25% of the world's population, predominantly children and women. There is emerging evidence that depletion of iron stores may have adverse consequences for adults even in the absence of anaemia."
  • 9% of US women age 20-49 have iron deficiency. [9]
  • "Iron from meat, poultry, and fish (i.e., heme iron) is absorbed two to three times more efficiently than iron from plants (i.e., non-heme iron)." [11]

4. Vitamin D

  • RDA for women age 19-30: 15mcg (600 IU). Amount provided in a multivitamin: 1000 IU (25mcg). Tolerable upper limit: 100mcg (4000 IU).
  • "In adults, vitamin D deficiency leads to osteomalacia, causing bone pain and muscle weakness." [8]
  • Note: There is not consistent evidence of a link between Vitamin D and MS. [7, pages 173-174]

So, that is my rushed personal case for a generally recommended daily multivitamin for women. It was fun to research since I was analyzing the nutrients of my own multivitamin earlier today anyway. There are lots of good multivitamin summary articles as well, such as -- thinking of a multivitamin as a "nutrition insurance policy". But at the end of the day you're only as healthy as you feel, as they say in Taxi Driver, and I am the one who's half-deaf [I was suffering from mold allergies at the time].

^ "Recommended Dietary Allowance": Meets or exceeds the daily dietary requirements for 97.5% of the US population.
+ Reference multivitamin: Bayer One-A-Day Women's Multivitamin.
* "Tolerable upper limit": A Tolerable Upper Intake Level (UL) is the highest level of daily nutrient intake that is likely to pose no risk of adverse health effects to almost all individuals in the general population. (Dietary Reference Intakes (DRIs): Estimated Average Requirements. Food and Nutrition Board, Institute of Medicine, National Academies)

1. Recommended Intakes for Individuals --
7. Dietary Reference Intakes for Calcium and Vitamin D, IOM 2010.

Text Message Experiments in 2008

A brief summary presentation of my mobile research at Rock the Vote, as presented to the Analyst Group in Nov. 2008.

Congressional Redistricting Reform in Texas (2007)


The following is my professional report for my master's degree in public affairs, which I wrote somewhat hastily in the spring of 2007. Please excuse the many typos and generally rushed writing. My first reader was Professor (& current Austin city councilman) Bill Spelman and my second reader was former state representative Sherri Greenberg.


Congressional redistricting occurs across the states after every decennial census, and is typically a fiercely partisan process marked by partisan strategizing, cries of gerrymandering, and endless court battles. Single-member districts are drawn to be contiguous, equipopulous, and compliant with the Voting Rights Act; states also attempt to make compact districts, align them with political subdivisions, and maintain communities of interest. A few states seek to create competitive districts that reduce the incumbent’s advantage. Redistricting law has changed markedly in the past century, beginning with the enforcement of equal population in the 1960s; since then emphasis has shifted to racial and partisan gerrymandering. One innovative technique to reduce gerrymandering is the independent redistricting commission; these commissions are reviewed for the six states who use them for congressional districts. Based on this comparative analysis, recent attempts in Texas to reform redistricting do not appear strong enough. Public input into the redistricting process should be strengthened and the current proposal to give undue influence to rural interests should be modified. But given the difficulty of enacting redistricting reform, more conservative alternatives should also be considered: a constitutional amendment prohibiting mid-decade redistricting, federal legislation on electoral procedures, and multi-member districts with proportional representation.

Table of Contents

  • Chapter 1. Principles of Redistricting - page 1
  • Chapter 2. Redistricting Law and Precedent - page 17
  • Chapter 3. The Several States - page 32
  • Chapter 4. Policy Recommendations - page 46
  • Bibliography - page 56
  • Vita - page 58

Logic Map of New Media

During my flight from DC to Texas today I finally got around to creating my first draft of a logic map (or model) for new media in a political advocacy context - a simpler abridged version has been sketched out on my office whiteboard for a few months now. See below:

Logic Map of New Media

My goal here is to lay out the interactions and chain of events that leads a visitor to a new media campaign, and then to document the primary engagement methods after they're in new media land. I didn't include every possible linkage arrow, and perhaps need to cut out some existing low priority ones to clean it up. The extent to which I go into other departments and the general level of detail is also somewhat arbitrary.

I used ArgoUML to create the model, which I was able to download for free and seems to be more for developers, so please excuse the ugliness. I had wanted to use Omnigraffle but haven't purchased it yet and waited until after my trial license had expired (only 14 days??) to begin this project. Working in 2d space was limiting: I would have preferred three or more dimensions to help organize & separate the different units.

What do you think - any missing areas, disagreements, or other thoughts? Have people used logic models for other projects? I plan to continue improving this so would be glad to get feedback. I would like to help develop stronger theoretical infrastructure for new media, and I think logic models are one good approach to support more rigorous program management & evaluation.

Voter Registration and Turnout by Age in U.S. Presidential Elections, 1996-2008

Voter Registration and Turnout by Age in U.S. Presidential Elections, 1996-2008

This chart takes a look at voter registration, turnout, and turnout-of-registered trends across the past four presidential elections in the United States: 1996, 2000, 2004, and 2008. The data come from the Current Population Survey and are smoothed to reduce the variability across age due to survey sampling. My previous version showed voter registration and turnout for 1996-2004, so this version has the added comparison of the 2008 elections.

A few things that pop out:

  • Voter registration of newly eligible voters (18-20 years old) fell slightly in 2008 compared to 2004, and is very slightly higher for the rest of the youth cohort (21-29).
  • Those older than 30 have a drop in registration compared to 2004, falling back to 2000 levels.
  • Turnout for youth is similar for the 18-20 year old group with small increases for the 21-29 repeat voters compared to 2004; the huge increases compared to 1996/2000 are maintained all the way up to age 40. Older voters have the same or lower levels of turnout compared to 2004.
  • At the top, turnout of those registered was the highest for young people in 2008. This suggests that GOTV to young people has continued to improve, particularly if we look back to the much lower curves for 1996 and 2000.

What other thoughts or interpretations do people have? And what do you think about the chart - any ideas for improvement? The main thing I would like to add is an additional table or bar chart showing the distribution of the population across age groups, which will help with gauging each age cohort's relative importance in terms of total votes.

Finally, I have attached the Excel file with the data and the chart itself, so feel free to download it and play around with things. It also includes an alternative black-background version which should be better for projection and makes the colors pop - it was inspired by Al Gore's climate change slides, which were in the background as I updated this chart.

Turnout in student elections at The University of Texas

The student elections were last week at The University of Texas - congrats to Liam, et al. The focus was on Student Government but elections also included Texas Student Media, Graduate Student Assembly, and others. Overall turnout was 10,018 votes (PDF), or 20% of the student body based on Fall 2008 enrollment figures. While not yet available, the actual student body in the spring semester tends to be slightly lower than the fall due to drop-outs and transfers. Once spring enrollment data is released the estimated turnout will likely be 0.5 to 1.5 percentage points higher.

Here is a long-term graph of turnout:

UT Student Government Election Turnout

It's pretty clear that 1) online voting increases turnout, 2) a larger student body tends to have lower turnout, 3) while recent turnout is relatively high, we haven't gotten close to pre-1970s turnout. Thoughts?

Use Python + Gmail to connect your email list to Twitter, Flickr, and everything

Online social media is growing in complexity

The role of online social media continues to grow in importance for political organizations. Facebook, MySpace, YouTube, Flickr, Twitter, and dozens of other sites need to be managed, tracked, and integrated. This takes a lot of staff time, but luckily these services continue to expand their support for external programs that can manipulate data or perform actions. APIs (application programming interfaces) allow different websites or computers to talk together and do things in the background. Even sites which don't provide official APIs can often be automated using third-party tools.

Loading emails into Gmail contacts fosters integration

In this article I'll show one simple script that uses Python and a third-party library (libgmail) to load a file of emails into gmail contacts. It's very common for web applications to support interfacing with gmail accounts to add users, so once your emails are in gmail it's simple to import them into Flickr, Twitter, etc.

The benefits of data integration

Why do this? Well for one, it helps organizations bootstrap their social media relationships. Normally your email list will be the largest medium for contacting members, so syncing that list into other services can really jump start your campaigns there. Rather than start from scratch you can build off your existing members, connect with them in different ways, and expand your online communication offerings. Even better, integrate social network synchronization into your data workflow: e.g. on a nightly (or hourly) basis sync new email subscribers into gmail, then sign into your socnets and add any new accounts you find. The "add any new accounts" step can be automated as well and will be explored in a later article.

Running the code

Python is the scripting glue for, and I use libgmail to interface with Gmail. Libgmail requires the mechanize module if you haven't already installed it, but if you have easy_install it's a single command. Here are the commands (I'm on OS X):

Example output:

It will take a while to load the emails into gmail, but once they are there it's just a few clicks to add all your members on Flickr, Twitter, or any other service that can import from gmail contacts!

Following our volunteers on Twitter

I tested this concept by loading 5,215 volunteer records into gmail. Of those, 590 could be found on Twitter (11.3% match rate) - not too shabby. And because the contacts are stored in gmail, I can always import again, say once a month, and add members who have recently joined. It would be interesting to see people try this for their own members and see what the match rates are for different services and organizations.

Final notes

  • Create a new gmail account specifically for this data sync - don't use an existing account.
  • Every 2,000 records or so gmail will stop processing records and the script will abort. Just run it again and it will resume where it left off.
  • Twitter seems to only add up to 200 followers at a time, but you can repeat as needed.
  • Because it is slow, this method works best as part as incremental workflow on new records. For loading existing email records it may be easier to use gmail's import contacts via csv feature. The problem with the csv method is that it's limited to 3,000 contacts per file, so in order to handle more than a few thousand records you'll want an automated solution.
  • In theory, once you import email records into a service, you will want to capture the respective username & id for that email record. Then you can integrate that into your main database and use it for future messaging (potentially via that service's API). This would be automated.
  • Feel free to fork and make improvements to It's easy to do on github. Or post feedback here.
  • Tip of the hat to KF for thinking of an email -> Twitter data sync in the first place. Good call!

And here's the code


Subscribe to RSS