Friday, 29 March 2013

DATA ANALYSIS


DATA ANALYSIS
When the domain from which data are harvested is a science or engineering field, the terms data processing and information systems are considered too broad, and the more specialized term data analysis is typically used. Data analysis, arguably a special kind of data processing, focusses on highly-specialized and highly-accurate algorithmic derivations and statistical calculations that are less often observed in the typical general business environment.
A divergence of culture between data processing in general and data analysis is exhibited in the numerical representations generally used; In data processing, measurements are typically stored as integers,  fixed –point or binary-coded decimal representations of numbers, whereas the majority of measurements in data analysis are stored as floating-point representations of rational numbers.
For data analysis, packages like DAP, gretl or PSPP are often used.
Data Processing
Basically, data are nothing but facts (organized or unorganized) which can be converted into other forms to make it useful, clear and practically used. This process of converting facts to information is Processing. Practically all naturally occurring processes can be viewed as examples of data processing systems where "observable" information in the form of pressure, light, etc. are converted by human observers into electrical signals in the nervous system as the senses we recognize as touch, sound, and vision. Even the interaction of non-living systems may be viewed in this way as rudimentary information processing systems. Conventional usage of the terms data processing and information systems restricts their use to refer to the algorithmic derivations, logical deductions, and statistical calculations that recur perennially in general business environments, rather than in the more expansive sense of all conversions of real-world measurements into real-world information in, say, an organic biological system or even a scientific or engineering system.

Editing
Data have to be edited, especially when they relate to responses to open-ended questions of interviews and questionnaires, or unstructured observations. In other words, information that may have been noted down by the interviewer, observer, or researcher in a hurry must be clearly deciphered so that it may be coded systematically in its entirety. Lack of clarity at this stage will result later in confusion. The edited data should be identifiable through the use of a different color pencil or ink so that original information is still available in case of further doubts.
Incoming mailed questionnaire data have to be checked for incompleteness and inconsistencies, if any, by designated members of research staff. Inconsistencies that can be logically corrected should be rectified and edited at this stage.Much of the editing is automatically taken care of in the case of computer-assisted telephone interviews and electronically administered questionnaires, even as the respondent is answering the question.
Handling Blank Responses
Not all respondents answer every item in the questionnaire. Answers may have been left blank because the respondent did not understand the question, did not know the answer, was not willing to answer, or was simply indifferent to the need to respond the entire questionnaire. If a substantial number of questions – say 25% of the items in the questionnaire – have been left unanswered, it may be a good idea to throw out the questionnaire and not include it in the data set for analysis. In this event, it is important to mention the number of returned but unused responses due to excessive missing data in the final report submitted to the sponsor of the study. If, however, only two or three items are left blank in a questionnaire with, say, 30 or more items, we need to decide how these blank responses are to be handled.One way to handle a blank response to an interval-scaled item with a midpoint would be assign the midpoint in the scale as the response to the particular item. An alternative way is to allow the computer to ignore the blank responses when the analyses are done. There are several ways to handling blank responses; a common approach, however, is either to give the midpoint in the scale as the value or to ignore the particular item during the analysis.
Coding
The next step is to code the responses. Scanner sheets can be used for collecting questionnaire data; such sheets facilitate the entry of the responses directly into the computer without manual keying in of the data. However, if for whatever reason this cannot be done, then it is perhaps better to use a coding sheet first to transcribe the data from the questionnaire and then key in the data. This method, in contrast to flipping through each questionnaire for each item, avoids confusions, especially when there are many questions and a large number of questionnaires as well.
It is possible to key in the data directly from the questionnaires, but that would need flipping through several questionnaires, page by page, resulting in possible errors and omissions of items. Transfer of the data first onto a code sheet would thus help.
Human errors can occur while coding. At least 10% of the coded questionnaires should therefore be checked for coding accuracy.
Their selection may follow a systematic sampling procedure. That is, every nth form coded could be verified for accuracy. If many errors are found in the sample, all items may have to be checked.
Categorizing
At this point it is useful to set up a scheme for categorizing the variables such that the several items measuring a concept are all grouped together.
Responses to some of the negatively worded questions have also to be reversed so that all answers are in the same direction.
If the questions measuring a concept are not contiguous but scattered over various parts of the questionnaire, care has to be taken to include all the items without any omission or wrong inclusion.

Entering Data
If questionnaire data are not collected on scanner answer sheets, which can be directly entered into the computer as a data file, the raw data will have to be manually keyed into the computer. Raw data can be entered through and software program. For instance, the SPSS Data Editor, which looks like a spread-editor represents a case, and each column represents a variable. All missing values will appear with a period (dot) in the cell. It is possible to add, change, or delete values after the data have been entered.
It is also easy to compute the new variables that have been categorized earlier, using the Compute dialog box, which opens when the Transform icon is chosen. Once the missing values, the recodes, and the computing of new variables are taken care of, the data are ready for analysis.
Interpretation of Data Analyzed
After the data has been completely analyzed, its results have to be properly interpreted. That interpretation of results is the most meaningful to the organization.

Classification and Tabulation of Data in Research

Classification is the way of arranging the data in different classes in order to give a definite form and a coherent structure to the data collected, facilitating their use in the most systematic and effective manner.
 It is the process of grouping the statistical data under various understandable homogeneous groups for the purpose of convenient interpretation. A uniformity of attributes is the basis criterion for classification;
 and the grouping of data is made according to similarity. Classification becomes necessary when there is diversity in the data collected for meaningful presentation and analysis. However, in respect of homogeneous presentation of data, classification may be unnecessary.

Objectives of classification of data:
  • To group heterogeneous data under the homogeneous group of common characteristics;
  • To facility similarity of various group;
  • To facilitate effective comparison;
  • To present complex, haphazard and scattered dates in a concise, logical, homogeneous, and intelligible form;
  • To maintain clarity and simplicity of complex data;
  • To identify independent and dependent variables and establish their relationship;
  • To establish a cohesive nature for the diverse data for effective and logical analysis;
  • To make logical and effective quantification
A good classification should have the characteristics of clarity, homogeneity, and equality of scale, purposefulness, accuracy, stability, flexibility, and unambiguity.
Classification is of two types, viz., quantitative classification, which is on the basis of variables or quantity; and qualitative classification (classification according to attributes). The former is the way of grouping the variables, say quantifying the variables in cohesive groups, while the latter group the data on the basis of attributes or qualities. Again, it may be multiple classification or dichotomous classification. The former is the way of making many (more than two) groups on the basis of some quality or attributes, while the latter is the classification into two groups on the basis of the presence or absence of a certain quality. Grouping the workers of a factory under various income (class intervals) groups comes under multiple classifications; and making two groups into skilled workers and unskilled workers is dichotomous classification. The tabular form of such classification is known as statistical series, which may be inclusive or exclusive. The classified data may be arranged in tabular forms (tables) in columns and rows. Tabulation is the simplest way of arranging the data, so that anybody can understand it in the easiest way. It is the most systematic way of presenting numerical data in an easily understandable form. It facilitates a clear and simple presentation of the data, a clear expression of the implication, and an easier and more convenient comparison. There can be simple or complex tables, and general purpose or summary tables. Classification and tabulation are interdependent events in a research.
FREQUENCY DISTRIBUTIONS

INTRODUCTION

The next step after the completion of data collection is to organize the data into a meaningful form so that a trend, if any, emerging out of the data can be seen easily. One of the common methods for organizing data is to construct frequency distribution. Frequency distribution is an organized tabulation/graphical representation of the number of individuals in each category on the scale of measurement.[1] It allows the researcher to have a glance at the entire data conveniently.
It shows whether the observations are high or low and also whether they are concentrated in one area or spread out across the entire scale.
Thus, frequency distribution presents a picture of how the individual observations are distributed in the measurement scale.

DISPLAYING FREQUENCY DISTRIBUTIONS

Frequency tables

A frequency (distribution) table shows the different measurement categories and the number of observations in each category.
Before constructing a frequency table, one should have an idea about the range (minimum and maximum values). The range is divided into arbitrary intervals called “class interval.”
 If the class intervals are too many, then there will be no reduction in the bulkiness of data and minor deviations also become noticeable.
On the other hand, if they are very few, then the shape of the distribution itself cannot be determined. Generally, 6–14 intervals are adequate.

The width of the class can be determined by dividing the range of observations by the number of classes. The following are some guidelines regarding class widths:[1]
  • It is advisable to have equal class widths. Unequal class widths should be used only when large gaps exist in data.
  • The class intervals should be mutually exclusive and nonoverlapping.
  • Open-ended classes at the lower and upper side (e.g., <10, >100) should be avoided.
The frequency distribution table of the resting pulse rate in healthy individuals is given in Table 1. It also gives the cumulative and relative frequency that helps to interpret the data more easily.

Table 1

Frequency distribution of the resting pulse rate in healthy volunteers (N = 63)

Frequency distribution graphs

A frequency distribution graph is a diagrammatic illustration of the information in the frequency table.

Histogram

A histogram is a graphical representation of the variable of interest in the X axis and the number of observations (frequency) in the Y axis. Percentages can be used if the objective is to compare two histograms having different number of subjects. A histogram is used to depict the frequency when data are measured on an interval or a ratio scale. Figure 1 depicts a histogram constructed for the data given in Table 1.
Figure 1
Histogram of the resting pulse rate in healthy volunteers (N = 63)
A bar diagram and a histogram may look the same but there are three important differences between them:
In a histogram, there is no gap between the bars as the variable is continuous. A bar diagram will have space between the bars.
All the bars need not be of equal width in a histogram (depends on the class interval), whereas they are equal in a bar diagram.
The area of each bar corresponds to the frequency in a histogram whereas in a bar diagram, it is the height [Figure 1].

Frequency polygon

A frequency polygon is constructed by connecting all midpoints of the top of the bars in a histogram by a straight line without displaying the bars. A frequency polygon aids in the easy comparison of two frequency distributions. When the total frequency is large and the class intervals are narrow, the frequency polygon becomes a smooth curve known as the frequency curve. A frequency polygon illustrating the data in Table 1 is shown in Figure 2.

Figure 2

Frequency polygon of the resting pulse rate in healthy volunteers (N = 63)

Box and whisker plot

This graph, first described by Tukey in 1977, can also be used to illustrate the distribution of data. There is a vertical or horizontal rectangle (box), the ends of which correspond to the upper and lower quartiles (75th and 25th percentile, respectively). Hence the middle 50% of observations are represented by the box.
The length of the box indicates the variability of the data. The line inside the box denotes the median (sometimes marked as a plus sign). The position of the median indicates whether the data are skewed or not.
If the median is closer to the upper quartile, then they are negatively skewed and if it is near the lower quartile, then positively skewed.
The lines outside the box on either side are known as whiskers [Figure 3]. These whiskers are 1.5 times the length of the box, i.e., the interquartile range (IQR). The end of whiskers is called the inner fence and any value outside it is an outlier. If the distribution is symmetrical, then the whiskers are of equal length.
 If the data are sparse on one side, the corresponding side whisker will be short. The outer fence (usually not marked) is at a distance of three times the IQR on either side of the box.
 The reason behind having the inner and outer fence at 1.5 and 3 times the IQR, respectively, is the fact that 95% of observations fall within 1.5 times the IQR, and it is 99% for 3 times the IQR.

Figure 3
Figure 3
Schematic diagram of a “box and whisker plot”




CHARACTERISTICS OF FREQUENCY DISTRIBUTION

There are four important characteristics of frequency distribution.
 They are as follows:
  • Measures of central tendency and location (mean, median, mode)
  • Measures of dispersion (range, variance, standard deviation)
  • The extent of symmetry/asymmetry (skewness)
  • The flatness or peakedness (kurtosis).

Diagrammatic Presentation of Data

 Introduction
Although tabulation is very good technique to present the data, but diagrams are an advanced technique to represent data.
As a layman, one cannot understand the tabulated data easily but with only a single glance at the diagram, one gets complete picture of the data presented.
According to M.J. Moroney, "diagrams register a meaningful impression almost before we think.






Importance or utility of Diagrams
  • Diagrams give a very clear picture of data. Even a layman can understand it very easily and in a short time.
  • We can make comparison between different samples very easily. We don't have to use any statistical technique further to compare.
  • This technique can be used universally at any place and at any time. This technique is used almost in all the subjects and other various fields.
  • Diagrams have impressive value also. Tabulated data has not much impression as compared to Diagrams. A common man is impressed easily by good diagrams.
  • This technique can be used for numerical type of statistical analysis, e.g. to locate Mean, Mode, Median or other statistical values.
  • It does not save only time and energy but also is economical. Not much money is needed to prepare even good diagrams.
  • These give us much more information as compared to tabulation. Technique of tabulation has its own limits.
  • This data is easily remembered. Diagrams which we see leave their lasting impression much more than other data techniques.
  • Data can be condensed with diagrams. A simple diagram can present what even cannot be presented by 10000 words.







General Guidelines for Diagrammatic presentation
  • The diagram should be properly drawn at the outset. The pith and substance of the subject matter must be made clear under a broad heading which properly conveys the purpose of a diagram.
  • The size of the scale should neither be too big nor too small. If it is too big, it may look ugly. If it is too small, it may not convey the meaning. In each diagram, the size of the paper must be taken note-of. It will help to determine the size of the diagram.
  • For clarifying certain ambiguities some notes should be added at the foot of the diagram. This shall provide the visual insight of the diagram.
  • Diagrams should be absolutely neat and clean. There should be no vagueness or overwriting on the diagram.
  • Simplicity refers to love at first sight. It means that the diagram should convey the meaning clearly and easily.
  • Scale must be presented along with the diagram.
  • It must be Self-Explanatory. It must indicate nature, place and source of data presented.
  • Different shades, colors can be used to make diagrams more easily understandable.
  • Vertical diagram should be preferred to Horizontal diagrams.
  • It must be accurate. Accuracy must not be done away with to make it attractive or impressive.







Limitations of Diagrammatic Presentation
  • Diagrams do not present the small differences properly.
  • These can easily be misused.
  • Only artist can draw multi-dimensional diagrams.
  • In statistical analysis, diagrams are of no use.
  • Diagrams are just supplement to tabulation.
  • Only a limited set of data can be presented in the form of diagram.
  • Diagrammatic presentation of data is a more time consuming process.
  • Diagrams present preliminary conclusions.
  • Diagrammatic presentation of data shows only on estimate of the actual behavior of the variables.
Types of Diagrams
(a) Line Diagrams
In these diagrams only line is drawn to represent one variable. These lines may be vertical or horizontal. The lines are drawn such that their length is the proportion to value of the terms or items so that comparison may be done easily.
(b) Simple Bar Diagram
Like line diagrams these figures are also used where only single dimension i.e. length can present the data.
Procedure is almost the same, only one thickness of lines is measured. These can also be drawn either vertically or horizontally.
Breadth of these lines or bars should be equal. Similarly distance between these bars should be equal. The breadth and distance between them should be taken according to space available on the paper.

(c) Multiple Bar Diagrams
The diagram is used, when we have to make comparison between more than two variables. The number of variables may be 2, 3 or 4 or more. In case of 2 variables, pair of bars is drawn. Similarly, in case of 3 variables, we draw triple bars.
The bars are drawn on the same proportionate basis as in case of simple bars. The same shade is given to the same item. Distance between pairs is kept constant.
(d) Sub-divided Bar Diagram
The data which is presented by multiple bar diagram can be presented by this diagram. In this case we add different variables for a period and draw it on a single bar as shown in the following examples.
 The components must be kept in same order in each bar. This diagram is more efficient if number of components is less i.e. 3 to 5.
(e) Percentage Bar Diagram
Like sub-divide bar diagram, in this case also data of one particular period or variable is put on single bar, but in terms of percentages. Components are kept in the same order in each bar for easy comparison.
(f) Duo-directional Bar Diagram
In this case the diagram is on both the sides of base line i.e. to left and right or to above or below sides.
(g) Broken Bar Diagram
This diagram is used when value of some variable is very high or low as compared to others. In this case the bars with bigger terms or items may be shown broken.

Graphic Presentation of Data

 Introduction
A graph refers to the plotting of different valves of the variables on a graph paper which gives the movement or a change in the variable over a period of time.
Diagrams can present the data in an attractive style but still there is a method more reliable than this.
 Diagrams are often used for publicity purposes but are not of much use in statistical analysis. Hence graphic presentation is more effective and result oriented.
Diagrams can present the data in an attractive style but still there is a method more reliable than this.
 Diagrams are often used for publicity purposes but are not of much use in statistical analysis. Hence graphic presentation is more effective and meaningful.
According to A. L. Boddington, "The wandering of a line is more powerful in its effect on the mind than a tabulated statement;
 it shows what is happening and what is likely to take place, just as quickly as the eye is capable of working."






Advantages of Graphs
The presentation of statistics in the form of graphs facilitates many processes in economics. the main uses of graphs are as under:
  • Attractive and Effective presentation of Data: The statistics can be presented in attractive and effective way by graphs. A fact that an ordinary man can not understand easily, could understand in a better way by graphs. Therefore, it is said that a picture is worth of a thousand words.
  • Simple and Understandable Presentation of Data: Graphs help to present complex data in a simple and understandable way. Therefore, graphs help to remove the complex nature of statistics.
  • Useful in Comparison: Graphs also help to compare the statistics. IF investment made in two different ventures is presented through graphs, then it becomes easy to understand the difference between the two.
  • Useful for Interpretation: Graphs also help to interpret the conclusion. It saves time as well as labour.
  • Remembrance for long period: Graphs help to remember the facts for a long time and they cannot be forgotten.
  • Helpful in Predictions: Through graphs, tendencies that could occur in near future can be predicted in a better way.
  • Universal utility: In modern era, graphs can be used in all spheres such as trade, economics, government departments, advertisement, etc.
  • Information as well as Entertainment: Graphs help us in entertainment as well as for providing information. By graphs there occurs no hindrance in the deep analysis of every information.
  • Helpful in Transmission of Information: Graphs help in the process of transmission as well as information of facts.
  • No Need for training: When facts are presented through graphs there is no need for special training for these interpretations.

Rules for the construction of Graph
The following are the main rules to construct a graph:
  • Every graph must have a suitable title which should clearly convey the main idea, the graph intends to portray.
  • The graph must suit to the size of the paper.
  • The scale of the graph should be in even numbers or in multiples.
  • Footnotes should be given at the bottom to illustrate the main points about the graph.
  • Graph should be as simple as possible.
  • In order to show many items in a graph, index for identification should be given.
  • A graph should be neat and clean. It should be appealing to the eyes.
  • Every graph should be given with a table to ensure whether the data has been presented accurately or not.
  • The test of a good graph depends on the case with which the observer can interpret it. Thus economy in cost and energy should be exercised in drawing the graph.
  •  
Limitations
Following are the main drawbacks/ limitations of graphs.
Limited Application: Graphic representation is useful for a common man but for an expert, its utility is limited.
Lack of Accuracy: Graphs do not measure the magnitude of the data. They only depict the fluctuations in them.
Subjective: Graphs are subjective in character. Their interpretation varies from person to person.
Misleading Conclusions: The person who has no knowledge can draw misleading conclusions from graphs.
Simplicity: Graph should be as simple as possible.
Index: In order to show many items in a graph, index for identification should be given.
How to choose a scale for a graph?
The scale indicates the unit of a variable that a fixed length of axis would represent. Scale may be different for both the axes.
 It should be taken in such a way so as to accommodate whole of the data on a given graph paper in a lucid and attractive style.
Sometimes data to be presented does not have low values but with large terms. We have to use the graph so as it may present the given data for comparison even.
Define various types of graphs.
Types of Graphs
There are two types of graphs.
  • Time series Graphs or Historigrams.
  • Frequency Distribution Graphs.
Time series graphs may be of one variable, two variables or more variables graph. Frequency distribution graphs present (a) histograms (b) Frequency Polygons (c) Frequency Curves and (d) Ogives.







No comments:

Post a Comment